Reiser4 FS Wiki https://reiser4.wiki.kernel.org/index.php/Main_Page MediaWiki 1.19.24 first-letter Media Special Talk User User talk Reiser4 FS Wiki Reiser4 FS Wiki talk File File talk MediaWiki MediaWiki talk Template Template talk Help Help talk Category Category talk Benchmarks 0 29 3921 3911 2014-06-12T20:36:09Z Chris goe 2 Sun Fire V40z url fixed Please add your benchmarks (and a description of your setup) here: * [[:File:Compilebench-0.6.pdf|Compilebench-0.6]] benchmark with Linux v3.14.1 (2014-06-11) = Old stuff = * Bonnie++ on a [http://docs.oracle.com/cd/E19121-01/sf.v40z/817-5248-21/chapter1.html#0_pgfId-1007454 Sun Fire V40z] ** [http://nerdbynature.de/benchmarks/v40z/2010-02-11/bonnie.html 2.6.33-rc6] (Feb 2010, vs. btrfs) ** [http://nerdbynature.de/benchmarks/v40z/2010-02-03/bonnie.html 2.6.33-rc6] (Feb 2010) ** [http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html 2.6.33-rc1] (Dec 2009) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * [http://marc.info/?l=reiserfs-devel&m=121484256609180&w=2 Reiser4 vs. everyone else] (June 2008) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. = Files = * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] 1e3e191d362cd8b12cb82877fffdce1828ce57ae 3911 3881 2014-06-12T20:16:56Z Chris goe 2 date + description added Please add your benchmarks (and a description of your setup) here: * [[:File:Compilebench-0.6.pdf|Compilebench-0.6]] benchmark with Linux v3.14.1 (2014-06-11) = Old stuff = * Bonnie++ on a [http://sunsolve.sun.com/handbook_pub/Systems/SunFireV40z/SunFireV40z.html v40z] ** [http://nerdbynature.de/benchmarks/v40z/2010-02-11/bonnie.html 2.6.33-rc6] (Feb 2010, vs. btrfs) ** [http://nerdbynature.de/benchmarks/v40z/2010-02-03/bonnie.html 2.6.33-rc6] (Feb 2010) ** [http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html 2.6.33-rc1] (Dec 2009) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * [http://marc.info/?l=reiserfs-devel&m=121484256609180&w=2 Reiser4 vs. everyone else] (June 2008) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. = Files = * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] 485557d468f1ecce3203b0fa1d3a1744e5516971 3881 2501 2014-06-11T16:02:55Z Edward 4 Add a compilebench-0.6 results on Linux-3.14.1 * [[File:Compilebench-0.6.pdf]] Please add your benchmarks (and a description of your setup) here. = Old stuff = * Bonnie++ on a [http://sunsolve.sun.com/handbook_pub/Systems/SunFireV40z/SunFireV40z.html v40z] ** [http://nerdbynature.de/benchmarks/v40z/2010-02-11/bonnie.html 2.6.33-rc6] (Feb 2010, vs. btrfs) ** [http://nerdbynature.de/benchmarks/v40z/2010-02-03/bonnie.html 2.6.33-rc6] (Feb 2010) ** [http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html 2.6.33-rc1] (Dec 2009) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * [http://marc.info/?l=reiserfs-devel&m=121484256609180&w=2 Reiser4 vs. everyone else] (June 2008) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. = Files = * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] f4a1f17e580dd7b54fccf1588a216ad22176fb2a 2501 1696 2012-09-25T17:36:27Z Chris goe 2 -> Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://sunsolve.sun.com/handbook_pub/Systems/SunFireV40z/SunFireV40z.html v40z] ** [http://nerdbynature.de/benchmarks/v40z/2010-02-11/bonnie.html 2.6.33-rc6] (Feb 2010, vs. btrfs) ** [http://nerdbynature.de/benchmarks/v40z/2010-02-03/bonnie.html 2.6.33-rc6] (Feb 2010) ** [http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html 2.6.33-rc1] (Dec 2009) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * [http://marc.info/?l=reiserfs-devel&m=121484256609180&w=2 Reiser4 vs. everyone else] (June 2008) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. = Files = * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] ad72c808e86b88ebe2f7bfc82f1fd3b8aac756b1 1696 1674 2010-04-16T07:06:48Z Chris goe 2 more benchmarks added Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://sunsolve.sun.com/handbook_pub/Systems/SunFireV40z/SunFireV40z.html v40z] ** [http://nerdbynature.de/benchmarks/v40z/2010-02-11/bonnie.html 2.6.33-rc6] (Feb 2010, vs. btrfs) ** [http://nerdbynature.de/benchmarks/v40z/2010-02-03/bonnie.html 2.6.33-rc6] (Feb 2010) ** [http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html 2.6.33-rc1] (Dec 2009) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * [http://marc.info/?l=reiserfs-devel&m=121484256609180&w=2 Reiser4 vs. everyone else] (June 2008) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. === Files === * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] d68f004d90b4b0b062fa42598b665dfe01643f44 1674 1655 2010-02-10T10:24:41Z Chris goe 2 more benchmarks added Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html v40z] (Dec 2009) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * [http://marc.info/?l=reiserfs-devel&m=121484256609180&w=2 Reiser4 vs. everyone else] (June 2008) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. === Files === * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] 8b5077d75dba648bda02e44f7d6d704e5178360d 1655 1645 2010-02-01T08:35:41Z Chris goe 2 bonnie++ on a v40z added Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://nerdbynature.de/benchmarks/v40z/2009-12-21/bonnie.html v40z] (Dec 2009) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. === Files === * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] 419df3c2acd8ae55b1ed076bdc36fae303173e3f 1645 1627 2009-11-19T04:55:30Z Chris goe 2 reordered Please add your benchmarks (and a description of your setup) here: * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. === Files === * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] 1cca77c3ea5cc898989113639cb984978d72a0a2 1627 1574 2009-08-31T07:29:31Z Chris goe 2 http://kerneltrap.org/node/715 added Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) * [http://kerneltrap.org/node/715 filesystem benchmarks by Grant Miner] (2003-08-05) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. === Files === * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] 26089813b70b6d418aa66a333814eb083ced76ad 1574 1573 2009-07-04T19:34:14Z Chris goe 2 mongo benchmark from: http://web.archive.org/web/20061115034208/http://thebsh.namesys.com/benchmarks/dist/ Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. === Files === * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] * [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ Mongo] benchmark [[category:ReiserFS]] [[category:Reiser4]] 286821d2af3270261a8dbb1fce532369a430b8ec 1573 1571 2009-07-04T19:30:50Z Chris goe 2 File:Slow.c.txt added Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. === Files === * [[File:Fs-bench-py.txt]] * [[File:Slow.c.txt]] [[category:ReiserFS]] [[category:Reiser4]] 590f9348818f5d081e5f41567f8f5eef0758a2f5 1571 1568 2009-07-03T20:23:20Z Chris goe 2 fs-bench.py added Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. === Files === * [[File:Fs-bench-py.txt]] [[category:ReiserFS]] [[category:Reiser4]] b6586e8be50c0ec24a47178e55062a430de87ac3 1568 1492 2009-07-03T20:18:11Z Chris goe 2 vizzzion benchmarks added Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * [http://vizzzion.org/?id=reiser4 Reiser4 vs. everyone else] (May 2004) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. [[category:ReiserFS]] [[category:Reiser4]] 05b864d5c9e3fab872e2405c0c9ea4357fb67a37 1492 1377 2009-06-27T08:55:18Z Chris goe 2 http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench added Please add your benchmarks (and a description of your setup) here: * Bonnie++ on a [http://nerdbynature.de/benchmarks/macmini/ MacMini] (July 2008) * Bonnie++ in a [http://nerdbynature.de/benchmarks/sid/2009-02-04-1237/ Xen DomU] (Feb 2009) * [[NamesysBenchmarks|Namesys benchmarks]] (2003-11-20) Peter Grandi [http://www.sabi.co.uk/Notes/linuxFS.html#fsRefsBench lists] quite a few benchmarks and filesystem-related papers on his site. [[category:ReiserFS]] [[category:Reiser4]] 886e5995a7ecd03cc6f3c3f8f0d8db77ba9e0803 1377 1359 2009-06-25T10:00:51Z Chris goe 2 Replaced content with 'Please add your benchmarks (and a description of your setup) here: * [[NamesysBenchmarks|Namesys benchmarks]] [[category:ReiserFS]] [[category:Reiser4]]' Please add your benchmarks (and a description of your setup) here: * [[NamesysBenchmarks|Namesys benchmarks]] [[category:ReiserFS]] [[category:Reiser4]] f141e4db074251545bb25c95f322cb4dfc3a6b6e 1359 1338 2009-06-25T09:04:40Z Chris goe 2 categories added <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <BASE HREF="http://www.namesys.com.wstub.archive.org/benchmarks.html"> <title>Benchmarks Of Reiser4</title> </head> <body> <h1>Benchmarks Of ReiserFS Version 4</h1> <body> <hr> <H1>Remarks</H1> <p> Htree (-O dir_index) is the recent attempt by ext3 developers to handle large directories as well as reiserfs by using better than linear search algorithms. One of the interesting results in this benchmark was that htree does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. <p> You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing htrees would be. <p> If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. <p> Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. <p> Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. <p> We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. <p> V4 is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon.... <p> Finally, remember that reiser4 is more space efficient than V3, the df measurements are there for looking at....;-) <hr> <ul> <li><font color=red>linux-2.6.15-mm4</font> : mongo <a href="#mongo.2.6.15-mm4"> comparison</a> <tt>ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin</tt> </li> <li>linux-2.6.11 : mongo <a href="#mongo.2.6.11"> comparison</a> against <tt>xfs and ext2</tt> </li> <li>linux-2.6.8.1-mm3 : mongo <a href="#mongo.2.6.8.1-mm3"> comparison</a> against <tt>ext3</tt> </li> <li>2004.03.26 slow.c <a href="#slow.2004.03.26">comparison</a> against <tt>ext2, ext3</tt> </li> <li>2003.11.20 mongo <a href="#mongo.2003.11.20">comparison</a> against <tt>ext3</tt> </li> <li>Bonnie++ <a href="#bonnie++.2003.09.30">comparison</a> of <tt>reiser4</tt> and <tt>ext3</tt> done at 2003.09.30. </li> <li>2003.09.25 mongo <a href="#mongo.2003.09.25">comparison</a> against <tt>ext3</tt> </li> <!-- <li>2003.08.28 mongo <a href="#mongo.2003.08.28">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.27 mongo <a href="#mongo.2003.08.27">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.26 mongo <a href="#mongo.2003.08.26">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.18 mongo <a href="#mongo.2003.08.18">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.12 mongo <a href="#mongo.2003.08.12">comparison</a> against <tt>ext3</tt> </li> --> <li>Older mongo <a href="#mongo.2003.08.28">results</a> (2003.08.28).</li> <li>mongo <a href="#mongo.2003.07.10">results</a> obtained before LinuxTAG (2003.07.10). Here reiser4 is compared with reiserfs.</li> <li>External benchmarks <a href="#grant">by Grant Miner</a>.</li> </ul> <hr> <a name="mongo.2.6.15-mm4"></a> linux-2.6.15-mm4 <a href="benchmarks/mongo_readme.html">mongo</a> results <p><b>Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with "cryptcompress" regular file plugin</b> <p> <p>The cryptcompress patch against 2.6.15-mm4 and new version of reiser4progs are from <br> ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches </p> <dl> <dt>reiser4 </dt> <dd>2.6.15-mm4 cryptcompress-4.patch</dd> <dt>mem total</dt> <dd>516312</dd> <dt>machine </dt> <dd>Intel(R) Xeon(TM) CPU 2.40GHz, <b>running UP kernel</b></dd> <dt>kernel </dt> <dd>2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006</dd> <dt>date </dt> <dd>Sat Feb 11 21:03:21 2006</dd> <dd>Sat Feb 11 21:18:43 2006</dd> <dd>Sat Feb 11 21:37:52 2006</dd> </dl> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr><td colspan=13 align=right> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> <tr><td colspan=13 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <!-- <p><b>Legend:</b> <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also.</p> --><p><a href="http://www.namesys.com/intbenchmarks/mongo/06.02.11.belka.crc/charts/comp.html">The same results in the charts</a></p> <hr> <a name="mongo.2.6.11"></a> linux-2.6.11 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2.6.8.1-mm3"></a> linux-2.6.8.1-mm3 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="slow.2004.03.26">2004.03.26 slow.c benchmark results</a> <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> <hr> <a name="mongo.2003.11.20"></a>2003.11.20 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.09.25"></a>2003.09.25 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.28"></a>2003.08.28 <a href="benchmarks/mongo_readme.html">mongo</a> results <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.27"></a>2003.08.27 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.26"></a>2003.08.26 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.18"></a>2003.08.18 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.12"></a>2003.08.12 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> <a name="mongo.2003.07.23"></a> Below is older (2003.07.23) mongo results. </p> <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <hr> <a name="bonnie++.2003.09.30"> This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:ReiserFS]] [[category:Reiser4]] 0ad009d736188aa8a886cf07e58bf8c6beacbe81 1338 2009-06-25T07:58:10Z Chris goe 2 http://web.archive.org/web/20061113154648/www.namesys.com/benchmarks.html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <BASE HREF="http://www.namesys.com.wstub.archive.org/benchmarks.html"> <title>Benchmarks Of Reiser4</title> </head> <body> <h1>Benchmarks Of ReiserFS Version 4</h1> <body> <hr> <H1>Remarks</H1> <p> Htree (-O dir_index) is the recent attempt by ext3 developers to handle large directories as well as reiserfs by using better than linear search algorithms. One of the interesting results in this benchmark was that htree does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. <p> You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing htrees would be. <p> If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. <p> Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. <p> Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. <p> We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. <p> V4 is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon.... <p> Finally, remember that reiser4 is more space efficient than V3, the df measurements are there for looking at....;-) <hr> <ul> <li><font color=red>linux-2.6.15-mm4</font> : mongo <a href="#mongo.2.6.15-mm4"> comparison</a> <tt>ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin</tt> </li> <li>linux-2.6.11 : mongo <a href="#mongo.2.6.11"> comparison</a> against <tt>xfs and ext2</tt> </li> <li>linux-2.6.8.1-mm3 : mongo <a href="#mongo.2.6.8.1-mm3"> comparison</a> against <tt>ext3</tt> </li> <li>2004.03.26 slow.c <a href="#slow.2004.03.26">comparison</a> against <tt>ext2, ext3</tt> </li> <li>2003.11.20 mongo <a href="#mongo.2003.11.20">comparison</a> against <tt>ext3</tt> </li> <li>Bonnie++ <a href="#bonnie++.2003.09.30">comparison</a> of <tt>reiser4</tt> and <tt>ext3</tt> done at 2003.09.30. </li> <li>2003.09.25 mongo <a href="#mongo.2003.09.25">comparison</a> against <tt>ext3</tt> </li> <!-- <li>2003.08.28 mongo <a href="#mongo.2003.08.28">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.27 mongo <a href="#mongo.2003.08.27">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.26 mongo <a href="#mongo.2003.08.26">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.18 mongo <a href="#mongo.2003.08.18">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.12 mongo <a href="#mongo.2003.08.12">comparison</a> against <tt>ext3</tt> </li> --> <li>Older mongo <a href="#mongo.2003.08.28">results</a> (2003.08.28).</li> <li>mongo <a href="#mongo.2003.07.10">results</a> obtained before LinuxTAG (2003.07.10). Here reiser4 is compared with reiserfs.</li> <li>External benchmarks <a href="#grant">by Grant Miner</a>.</li> </ul> <hr> <a name="mongo.2.6.15-mm4"></a> linux-2.6.15-mm4 <a href="benchmarks/mongo_readme.html">mongo</a> results <p><b>Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with "cryptcompress" regular file plugin</b> <p> <p>The cryptcompress patch against 2.6.15-mm4 and new version of reiser4progs are from <br> ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches </p> <dl> <dt>reiser4 </dt> <dd>2.6.15-mm4 cryptcompress-4.patch</dd> <dt>mem total</dt> <dd>516312</dd> <dt>machine </dt> <dd>Intel(R) Xeon(TM) CPU 2.40GHz, <b>running UP kernel</b></dd> <dt>kernel </dt> <dd>2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006</dd> <dt>date </dt> <dd>Sat Feb 11 21:03:21 2006</dd> <dd>Sat Feb 11 21:18:43 2006</dd> <dd>Sat Feb 11 21:37:52 2006</dd> </dl> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr><td colspan=13 align=right> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> <tr><td colspan=13 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <!-- <p><b>Legend:</b> <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also.</p> --><p><a href="http://www.namesys.com/intbenchmarks/mongo/06.02.11.belka.crc/charts/comp.html">The same results in the charts</a></p> <hr> <a name="mongo.2.6.11"></a> linux-2.6.11 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2.6.8.1-mm3"></a> linux-2.6.8.1-mm3 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="slow.2004.03.26">2004.03.26 slow.c benchmark results</a> <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> <hr> <a name="mongo.2003.11.20"></a>2003.11.20 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.09.25"></a>2003.09.25 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.28"></a>2003.08.28 <a href="benchmarks/mongo_readme.html">mongo</a> results <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.27"></a>2003.08.27 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.26"></a>2003.08.26 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.18"></a>2003.08.18 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.12"></a>2003.08.12 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> <a name="mongo.2003.07.23"></a> Below is older (2003.07.23) mongo results. </p> <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <hr> <a name="bonnie++.2003.09.30"> This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> b4cdf2dd58664fc36b1e0e57d060cabda38d6d31 Bugs 0 11 1302 2009-06-25T06:39:09Z Chris goe 2 Created page with 'If you find a bug, please report it to the [[Mailinglists|mailinglist]]. Already reported bugs can be found on the [http://bugzilla.kernel.org/buglist.cgi?query_format=advanced&s...' If you find a bug, please report it to the [[Mailinglists|mailinglist]]. Already reported bugs can be found on the [http://bugzilla.kernel.org/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=&product=File+System&component=Reiser4&component=ReiserFS&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bugidtype=include&order=Importance kernel.org Bugzilla] [[category:Reiser4]] [[category:ReiserFS]] 256f5a341e4ce5935716e52ac4df993927416071 Containers 0 1097 4249 4245 2017-06-20T23:26:46Z Chris goe 2 fsck linked Edward [http://www.spinics.net/lists/reiserfs-devel/msg05130.html posted] the following on 2016-08-01 to [[Mailinglists|reiserfs-devel]]: ---- Well, first of all I suggest to rename it, as the term "subvolume" is already reserved for the parts of reiser4 logical volumes, that I am currently working on. How about "storage containers", or simply "containers"? So a container is an isolated semantic subtree associated with some reiser4 partition and rooted with some directory (let's call it container's root). Isolation means that container's root doesn't have a parent, and, hence, can not be accessible by <tt>->lookup()</tt> during tree traversal (the procedure of name resolution). Thus, container's root looks like the volume's root (which also doesn't have a parent). So we already know how to create container's root: we just need to create a directory, which is similar to volume's root. = Creating a container = Every object in reiser4 has locality, which actually is object-id of the parent. Since container's root (like volume's root) doesn't have a parent, we should assign the same dummy locality <tt>FORMAT40_ROOT_LOCALITY</tt> (decimal 41) to the container's root. Directories in reiser4 as well as other objects are created by the function <tt>create_vfs_object()</tt>. Specifically, for every new directory this function creates: a) stat-data of the new directory; b) cde-item with two units (entries "." and ".."). The problem is that <tt>create_vfs_object()</tt> thinks of every new object as a child of volume's root, whereas container's root is not a child. So we'll need to work around this - simply don't perform any actions intended for the parent (updating number of links, size of the parent, etc). The next problem is that there is no dedicated system call to create containers, so we'll need to create a special user-space utility as a part of reiser4progs for this, and implement a respective ioctl in reiser4 kernel module. That utility should accept a mount point, that reiser4 partition (that we want to create a container on) is mounted at. Upon successful creation, the utility should return object-id of the new container. = Mounting a container = Every newly created container (like every object in reiser4) gets a unique object-id. This object-id should be passed to <tt>mount(8)</tt> among other parameters to inform Reiser4 kernel module, that we want to mount a certain container. It can be done, for example, by a mount option. There are examples how to implement a reiser4 mount option, which require a parameter (object-id of container's root in our case). I think that we need something like <tt>SB_FIELD_OPT</tt> (see <tt>init_super.c</tt>). The new option should set the passed container's id to the reiser4 superblock. By default (if no container's id was specified) reiser4 mount procedure will looks for the root, which have the object-id <tt>FORMAT40_ROOT_OBJECTID</tt>. If you specified an object-id by mount option, then it will look for the respective container's root. In other bits mounting volume and containers are absolutely similar. = Unmounting a container = No differences with unmounting a volume. = Deleting a container = <tt>rm -rf mountpoint</tt> in proper mount session, then unmount the container and finally remove the container's root (i.e. empty container) by the user-space utility. The respective ioctl implementation in reiser4 kernel module should locate and remove the following objects: a) container's stat-data b) cde-item with two entries: ".", and ".." Comments: 1) If container is not empty, then the utility should return error (suggest user to delete container's content in a proper mount session). 2) Deleting an empty container should be performed in a mount session for some another container. Otherwise, return error. 3) Volume and containers associated with the same reiser4 partition are distinguishable only by object-id. Volumes are containers which always have object-id <tt>FORMAT40_ROOT_OBJECTID</tt> (decimal 42). In all other bits containers and volumes are identical. So, you can easily remove an empty volume like a container by specifying that object-id (decimal 42). The condition (2) will guarantee that reiser4 partition will have at least one root (to prevent corruption). 4) If the volume (i.e. container with object-id 42) was deleted, then <tt>mount(8)</tt> with no container's id specified will fail because of inability to find the default root. So after deleting the volume you will always need to specify container's id at mount time. = Print a list of containers of a Reiser4 partition = Find all roots with locality <tt>FORMAT40_ROOT_LOCALITY</tt> on the specified partition and print their object-ids. This is also a business of user-space utility. I suggest to create one utility with the name "container.reiser4" for all needs with the following synopsis: container.reiser4 -c mountpoint # create a container on the mounted partition. container.reiser4 -d object-id mountpoint # delete a container with specified object-id from the mounted partition. container.reiser4 -l mountpoint # print a list of all containers of the mounted partition. = Fsck and containers = Current [[Reiser4progs|fsck]] will swear on the containers. I'll help to make it happy. Fsck should perform semantic passes in parallel (for all the containers of the partition). The pleasant bonus is that many containers on the same partition will make the partition check faster. Basically, that's all. After mounting a container you'll be able to access it in "rw", or "ro" modes, depending on how it was mounted - everything is the same as for usual volumes. Ah, <tt>reiser4_lookup()</tt> will require a minor modification to fall into proper container (I didn't check the code, but think that without modifications it will always fall into "default" root with object-id 42). [[category:Reiser4]] cc9be8e16425caf39d8e3c3abfe3a44ad850df87 4245 4153 2017-06-20T23:24:08Z Chris goe 2 category added Edward [http://www.spinics.net/lists/reiserfs-devel/msg05130.html posted] the following on 2016-08-01 to [[Mailinglists|reiserfs-devel]]: ---- Well, first of all I suggest to rename it, as the term "subvolume" is already reserved for the parts of reiser4 logical volumes, that I am currently working on. How about "storage containers", or simply "containers"? So a container is an isolated semantic subtree associated with some reiser4 partition and rooted with some directory (let's call it container's root). Isolation means that container's root doesn't have a parent, and, hence, can not be accessible by <tt>->lookup()</tt> during tree traversal (the procedure of name resolution). Thus, container's root looks like the volume's root (which also doesn't have a parent). So we already know how to create container's root: we just need to create a directory, which is similar to volume's root. = Creating a container = Every object in reiser4 has locality, which actually is object-id of the parent. Since container's root (like volume's root) doesn't have a parent, we should assign the same dummy locality <tt>FORMAT40_ROOT_LOCALITY</tt> (decimal 41) to the container's root. Directories in reiser4 as well as other objects are created by the function <tt>create_vfs_object()</tt>. Specifically, for every new directory this function creates: a) stat-data of the new directory; b) cde-item with two units (entries "." and ".."). The problem is that <tt>create_vfs_object()</tt> thinks of every new object as a child of volume's root, whereas container's root is not a child. So we'll need to work around this - simply don't perform any actions intended for the parent (updating number of links, size of the parent, etc). The next problem is that there is no dedicated system call to create containers, so we'll need to create a special user-space utility as a part of reiser4progs for this, and implement a respective ioctl in reiser4 kernel module. That utility should accept a mount point, that reiser4 partition (that we want to create a container on) is mounted at. Upon successful creation, the utility should return object-id of the new container. = Mounting a container = Every newly created container (like every object in reiser4) gets a unique object-id. This object-id should be passed to <tt>mount(8)</tt> among other parameters to inform Reiser4 kernel module, that we want to mount a certain container. It can be done, for example, by a mount option. There are examples how to implement a reiser4 mount option, which require a parameter (object-id of container's root in our case). I think that we need something like <tt>SB_FIELD_OPT</tt> (see <tt>init_super.c</tt>). The new option should set the passed container's id to the reiser4 superblock. By default (if no container's id was specified) reiser4 mount procedure will looks for the root, which have the object-id <tt>FORMAT40_ROOT_OBJECTID</tt>. If you specified an object-id by mount option, then it will look for the respective container's root. In other bits mounting volume and containers are absolutely similar. = Unmounting a container = No differences with unmounting a volume. = Deleting a container = <tt>rm -rf mountpoint</tt> in proper mount session, then unmount the container and finally remove the container's root (i.e. empty container) by the user-space utility. The respective ioctl implementation in reiser4 kernel module should locate and remove the following objects: a) container's stat-data b) cde-item with two entries: ".", and ".." Comments: 1) If container is not empty, then the utility should return error (suggest user to delete container's content in a proper mount session). 2) Deleting an empty container should be performed in a mount session for some another container. Otherwise, return error. 3) Volume and containers associated with the same reiser4 partition are distinguishable only by object-id. Volumes are containers which always have object-id <tt>FORMAT40_ROOT_OBJECTID</tt> (decimal 42). In all other bits containers and volumes are identical. So, you can easily remove an empty volume like a container by specifying that object-id (decimal 42). The condition (2) will guarantee that reiser4 partition will have at least one root (to prevent corruption). 4) If the volume (i.e. container with object-id 42) was deleted, then <tt>mount(8)</tt> with no container's id specified will fail because of inability to find the default root. So after deleting the volume you will always need to specify container's id at mount time. = Print a list of containers of a Reiser4 partition = Find all roots with locality <tt>FORMAT40_ROOT_LOCALITY</tt> on the specified partition and print their object-ids. This is also a business of user-space utility. I suggest to create one utility with the name "container.reiser4" for all needs with the following synopsis: container.reiser4 -c mountpoint # create a container on the mounted partition. container.reiser4 -d object-id mountpoint # delete a container with specified object-id from the mounted partition. container.reiser4 -l mountpoint # print a list of all containers of the mounted partition. = Fsck and containers = Current <tt>fsck</tt> will swear on the containers. I'll help to make it happy. Fsck should perform semantic passes in parallel (for all the containers of the partition). The pleasant bonus is that many containers on the same partition will make the partition check faster. Basically, that's all. After mounting a container you'll be able to access it in "rw", or "ro" modes, depending on how it was mounted - everything is the same as for usual volumes. Ah, <tt>reiser4_lookup()</tt> will require a minor modification to fall into proper container (I didn't check the code, but think that without modifications it will always fall into "default" root with object-id 42). [[category:Reiser4]] 200b8f9c9e26c8b47529f9f74877347d1f78f090 4153 2016-08-25T21:55:19Z Chris goe 2 from "Containers in Reiser4 (Was: Semantic (mountable) subvolumes in Reiser4)" Edward [http://www.spinics.net/lists/reiserfs-devel/msg05130.html posted] the following on 2016-08-01 to [[Mailinglists|reiserfs-devel]]: ---- Well, first of all I suggest to rename it, as the term "subvolume" is already reserved for the parts of reiser4 logical volumes, that I am currently working on. How about "storage containers", or simply "containers"? So a container is an isolated semantic subtree associated with some reiser4 partition and rooted with some directory (let's call it container's root). Isolation means that container's root doesn't have a parent, and, hence, can not be accessible by <tt>->lookup()</tt> during tree traversal (the procedure of name resolution). Thus, container's root looks like the volume's root (which also doesn't have a parent). So we already know how to create container's root: we just need to create a directory, which is similar to volume's root. = Creating a container = Every object in reiser4 has locality, which actually is object-id of the parent. Since container's root (like volume's root) doesn't have a parent, we should assign the same dummy locality <tt>FORMAT40_ROOT_LOCALITY</tt> (decimal 41) to the container's root. Directories in reiser4 as well as other objects are created by the function <tt>create_vfs_object()</tt>. Specifically, for every new directory this function creates: a) stat-data of the new directory; b) cde-item with two units (entries "." and ".."). The problem is that <tt>create_vfs_object()</tt> thinks of every new object as a child of volume's root, whereas container's root is not a child. So we'll need to work around this - simply don't perform any actions intended for the parent (updating number of links, size of the parent, etc). The next problem is that there is no dedicated system call to create containers, so we'll need to create a special user-space utility as a part of reiser4progs for this, and implement a respective ioctl in reiser4 kernel module. That utility should accept a mount point, that reiser4 partition (that we want to create a container on) is mounted at. Upon successful creation, the utility should return object-id of the new container. = Mounting a container = Every newly created container (like every object in reiser4) gets a unique object-id. This object-id should be passed to <tt>mount(8)</tt> among other parameters to inform Reiser4 kernel module, that we want to mount a certain container. It can be done, for example, by a mount option. There are examples how to implement a reiser4 mount option, which require a parameter (object-id of container's root in our case). I think that we need something like <tt>SB_FIELD_OPT</tt> (see <tt>init_super.c</tt>). The new option should set the passed container's id to the reiser4 superblock. By default (if no container's id was specified) reiser4 mount procedure will looks for the root, which have the object-id <tt>FORMAT40_ROOT_OBJECTID</tt>. If you specified an object-id by mount option, then it will look for the respective container's root. In other bits mounting volume and containers are absolutely similar. = Unmounting a container = No differences with unmounting a volume. = Deleting a container = <tt>rm -rf mountpoint</tt> in proper mount session, then unmount the container and finally remove the container's root (i.e. empty container) by the user-space utility. The respective ioctl implementation in reiser4 kernel module should locate and remove the following objects: a) container's stat-data b) cde-item with two entries: ".", and ".." Comments: 1) If container is not empty, then the utility should return error (suggest user to delete container's content in a proper mount session). 2) Deleting an empty container should be performed in a mount session for some another container. Otherwise, return error. 3) Volume and containers associated with the same reiser4 partition are distinguishable only by object-id. Volumes are containers which always have object-id <tt>FORMAT40_ROOT_OBJECTID</tt> (decimal 42). In all other bits containers and volumes are identical. So, you can easily remove an empty volume like a container by specifying that object-id (decimal 42). The condition (2) will guarantee that reiser4 partition will have at least one root (to prevent corruption). 4) If the volume (i.e. container with object-id 42) was deleted, then <tt>mount(8)</tt> with no container's id specified will fail because of inability to find the default root. So after deleting the volume you will always need to specify container's id at mount time. = Print a list of containers of a Reiser4 partition = Find all roots with locality <tt>FORMAT40_ROOT_LOCALITY</tt> on the specified partition and print their object-ids. This is also a business of user-space utility. I suggest to create one utility with the name "container.reiser4" for all needs with the following synopsis: container.reiser4 -c mountpoint # create a container on the mounted partition. container.reiser4 -d object-id mountpoint # delete a container with specified object-id from the mounted partition. container.reiser4 -l mountpoint # print a list of all containers of the mounted partition. = Fsck and containers = Current <tt>fsck</tt> will swear on the containers. I'll help to make it happy. Fsck should perform semantic passes in parallel (for all the containers of the partition). The pleasant bonus is that many containers on the same partition will make the partition check faster. Basically, that's all. After mounting a container you'll be able to access it in "rw", or "ro" modes, depending on how it was mounted - everything is the same as for usual volumes. Ah, <tt>reiser4_lookup()</tt> will require a minor modification to fall into proper container (I didn't check the code, but think that without modifications it will always fall into "default" root with object-id 42). 9fe950fdf42043f29a8d06a83ab4b8c69be50fb4 Credits 0 34 4227 1643 2017-05-09T13:47:37Z Edward 4 The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since [http://git.zen-sources.org/?p=mmotm.git;a=blob;f=fs/reiser4/README;hb=HEAD both] filesystems are [http://lxr.linux.no/linux+v2.6.30/fs/reiserfs/README licensed] under the [http://www.gnu.org/licenses/gpl-2.0.html GPLv2], hundreds of anonymous contributers should be on this list too. {| | Edward Shishkin || Reiser4 maintainer. ReiserFS journal relocation support, Reiser4 Encryption and Compression plugins, Different transaction models, Logical volumes. |- | Jeff Mahoney || ReiserFS(v3) maintainer. SuSE programmer, ReiserFS(v3) port to Power PC and sparc, Extended attributes support. |- | Vladimir Saveliev || Former lead Programmer. Reiser4 unix-file plugin, node plugin, extent and formatting item plugins. |- | Alexander "Zam" Zarochentcev || Distinguished Programmer. ReiserFS(v3) resizer, Reiser4 Locking, Flushing, Block allocator. |- | Vladimir Demidov || Programmer, General Director of [http://namesys.ru namesys.ru], alpha port, v4 parser. |- | Vitaly Fertman || Programmer. [[reiserfsck|fsck.reiserfs]], fsck.reiser4. |- | Elena Gryaznova || Testing and benchmarking. |- | Lex "FLX" Lyamin || Hostmaster and Sysadmin |- | George Beshers || Team Leader of the Masks and Process Oriented Security DARPA Project |- | Nate Diller || Programmer, Masks and Process Oriented Security DARPA Project |- | Hans Reiser || Owner, Architect, Manager, Programmer. |- | Ramon Reiser || Marketing, Instructional Technology, Technical Writing. |- | Yura Umanets || Programmer. [[reiser4progs]] |- | Josh MacDonald || Programmer. Transaction Manager. Flush Code |- | Yury Rupasov || Programmer. |- | Yuri "Sizif" Shevchuk || Programmer. |- | Jeremy Fitzhardinge || Volunteer. Author of hashing code. (<tt>teahash.c</tt>) |- | Roman Pozlevich || Programmer. |- | Nikita Danilov || Balancing, plugins. |- | Oleg "Green" Drokin || Release Manager. |- | Chris Mason || ReiserFS(v3) journaling code. |} === Companies === A few companies sponsored certain parts of ReiserFS/Reiser4 development: * The [http://www.darpa.mil/ Defense Advanced Research Projects Agency] was the primary sponsor of Reiser4 (DARPA does not endorse this project; it merely sponsored it.) * Journal relocation and resizing was sponsored by [http://www.applianceware.com ApplianceWare] * [http://en.wikipedia.org/wiki/Hierarchical_storage_management HSM] sponsored by [http://www.bigstorage.com/ BigStorage, Inc.] * Journaling sponsored by [http://www.suse.com SuSE] (originally sponsored by [http://mp3.com mp3.com]) * Debugging sponsored by [http://linspire.com/reiserlink Linspire] (now Xandros) [[category:Reiser4]] [[category:ReiserFS]] 9e840be3129e36d72840b76881d1df46dff90f1b 1643 1624 2009-11-18T20:15:17Z Chris goe 2 list reordered The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since [http://git.zen-sources.org/?p=mmotm.git;a=blob;f=fs/reiser4/README;hb=HEAD both] filesystems are [http://lxr.linux.no/linux+v2.6.30/fs/reiserfs/README licensed] under the [http://www.gnu.org/licenses/gpl-2.0.html GPLv2], hundreds of anonymous contributers should be on this list too. {| | Edward Shishkin || Current lead programmer. Encryption and Compression plugins. |- | Chris Mason || Programmer. Journaling code (formerly SuSE, now at Oracle) |- | Jeff Mahoney || SuSE programmer, port to Power PC and sparc, bitmap code. |- | Vladimir Saveliev || Former lead Programmer. |- | Alexander "Zam" Zarochentcev || Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. |- | Vladimir Demidov || Programmer, General Director of [http://namesys.ru namesys.ru], alpha port, v4 parser. |- | Vitaly Fertman || Programmer. [[reiserfsck|fsck]]. |- | Elena Gryaznova || Testing and benchmarking. |- | Lex "FLX" Lyamin || Hostmaster and Sysadmin |- | George Beshers || Team Leader of the Masks and Process Oriented Security DARPA Project |- | Nate Diller || Programmer, Masks and Process Oriented Security DARPA Project |- | Hans Reiser || Owner, Architect, Manager, Programmer. |- | Ramon Reiser || Marketing, Instructional Technology, Technical Writing. |- | Yura Umanets || Programmer. [[reiserfsprogs]] |- | Josh MacDonald || Programmer. Transaction Manager. Flush Code |- | Yury Rupasov || Programmer. |- | Yuri "Sizif" Shevchuk || Programmer. |- | Jeremy Fitzhardinge || Volunteer. Author of hashing code. (<tt>teahash.c</tt>) |- | Roman Pozlevich || Programmer. |- | Nikita Danilov || Balancing, plugins. |- | Oleg "Green" Drokin || Release Manager. |} === Companies === A few companies sponsored certain parts of ReiserFS/Reiser4 development: * The [http://www.darpa.mil/ Defense Advanced Research Projects Agency] was the primary sponsor of Reiser4 (DARPA does not endorse this project; it merely sponsored it.) * Journal relocation and resizing was sponsored by [http://www.applianceware.com ApplianceWare] * [http://en.wikipedia.org/wiki/Hierarchical_storage_management HSM] sponsored by [http://www.bigstorage.com/ BigStorage, Inc.] * Journaling sponsored by [http://www.suse.com SuSE] (originally sponsored by [http://mp3.com mp3.com]) * Debugging sponsored by [http://linspire.com/reiserlink Linspire] (now Xandros) [[category:Reiser4]] [[category:ReiserFS]] 6b370b19e55119ace68805641152a71ce60973f4 1624 1526 2009-08-31T07:00:24Z Chris goe 2 HSM explained The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since [http://git.zen-sources.org/?p=mmotm.git;a=blob;f=fs/reiser4/README;hb=HEAD both] filesystems are [http://lxr.linux.no/linux+v2.6.30/fs/reiserfs/README licensed] under the [http://www.gnu.org/licenses/gpl-2.0.html GPLv2], hundreds of anonymous contributers should be on this list too. === Contributers === {| | Hans Reiser || Owner, Architect, Manager, Programmer. |- | Vladimir Saveliev || Lead Programmer. |- | Alexander "Zam" Zarochentcev || Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. |- | Vladimir Demidov || Programmer, General Director of [http://namesys.ru namesys.ru], alpha port, v4 parser. |- | Vitaly Fertman || Programmer. [[reiserfsck|fsck]]. |- | Edward Shishkin || Programmer. Encryption and Compression plugins. |- | Chris Mason || Programmer. Journaling code (formerly SuSE, now at Oracle) |- | Jeff Mahoney || SuSE programmer, port to Power PC and sparc, bitmap code. |- | Elena Gryaznova || Testing and benchmarking. |- | Lex "FLX" Lyamin || Hostmaster and Sysadmin |- | George Beshers || Team Leader of the Masks and Process Oriented Security DARPA Project |- | Nate Diller || Programmer, Masks and Process Oriented Security DARPA Project |} === Former Contributors === {| | Ramon Reiser || Marketing, Instructional Technology, Technical Writing. |- | Yura Umanets || Programmer. [[reiserfsprogs]] |- | Josh MacDonald || Programmer. Transaction Manager. Flush Code |- | Yury Rupasov || Programmer. |- | Yuri "Sizif" Shevchuk || Programmer. |- | Jeremy Fitzhardinge || Volunteer. Author of hashing code. (<tt>teahash.c</tt>) |- | Roman Pozlevich || Programmer. |- | Nikita Danilov || Balancing, plugins. |- | Oleg "Green" Drokin || Release Manager. |} === Companies === A few companies sponsored certain parts of ReiserFS/Reiser4 development: * The [http://www.darpa.mil/ Defense Advanced Research Projects Agency] was the primary sponsor of Reiser4 (DARPA does not endorse this project; it merely sponsored it.) * Journal relocation and resizing was sponsored by [http://www.applianceware.com ApplianceWare] * [http://en.wikipedia.org/wiki/Hierarchical_storage_management HSM] sponsored by [http://www.bigstorage.com/ BigStorage, Inc.] * Journaling sponsored by [http://www.suse.com SuSE] (originally sponsored by [http://mp3.com mp3.com]) * Debugging sponsored by [http://linspire.com/reiserlink Linspire] (now Xandros) [[category:Reiser4]] [[category:ReiserFS]] 4154082a5d1f92c4973b8a2743d99478bd68e428 1526 1525 2009-06-27T19:41:39Z Chris goe 2 reiser4 readme The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since [http://git.zen-sources.org/?p=mmotm.git;a=blob;f=fs/reiser4/README;hb=HEAD both] filesystems are [http://lxr.linux.no/linux+v2.6.30/fs/reiserfs/README licensed] under the [http://www.gnu.org/licenses/gpl-2.0.html GPLv2], hundreds of anonymous contributers should be on this list too. === Contributers === {| | Hans Reiser || Owner, Architect, Manager, Programmer. |- | Vladimir Saveliev || Lead Programmer. |- | Alexander "Zam" Zarochentcev || Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. |- | Vladimir Demidov || Programmer, General Director of [http://namesys.ru namesys.ru], alpha port, v4 parser. |- | Vitaly Fertman || Programmer. [[reiserfsck|fsck]]. |- | Edward Shishkin || Programmer. Encryption and Compression plugins. |- | Chris Mason || Programmer. Journaling code (formerly SuSE, now at Oracle) |- | Jeff Mahoney || SuSE programmer, port to Power PC and sparc, bitmap code. |- | Elena Gryaznova || Testing and benchmarking. |- | Lex "FLX" Lyamin || Hostmaster and Sysadmin |- | George Beshers || Team Leader of the Masks and Process Oriented Security DARPA Project |- | Nate Diller || Programmer, Masks and Process Oriented Security DARPA Project |} === Former Contributors === {| | Ramon Reiser || Marketing, Instructional Technology, Technical Writing. |- | Yura Umanets || Programmer. [[reiserfsprogs]] |- | Josh MacDonald || Programmer. Transaction Manager. Flush Code |- | Yury Rupasov || Programmer. |- | Yuri "Sizif" Shevchuk || Programmer. |- | Jeremy Fitzhardinge || Volunteer. Author of hashing code. (<tt>teahash.c</tt>) |- | Roman Pozlevich || Programmer. |- | Nikita Danilov || Balancing, plugins. |- | Oleg "Green" Drokin || Release Manager. |} === Companies === A few companies sponsored certain parts of ReiserFS/Reiser4 development: * The [http://www.darpa.mil/ Defense Advanced Research Projects Agency] was the primary sponsor of Reiser4 (DARPA does not endorse this project; it merely sponsored it.) * Journal relocation and resizing was sponsored by [http://www.applianceware.com ApplianceWare] * [[HSM]] <!-- Hierarchial Storage Management? --> sponsored by [http://www.bigstorage.com/ BigStorage, Inc.] * Journaling sponsored by [http://www.suse.com SuSE] (originally sponsored by [http://mp3.com mp3.com]) * Debugging sponsored by [http://linspire.com/reiserlink Linspire] (now Xandros) [[category:Reiser4]] [[category:ReiserFS]] c67b471e0179d3edb6b8b7ceb37a7b6e225c877a 1525 1382 2009-06-27T19:39:23Z Chris goe 2 GPLv2 The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since both filesystems are [http://lxr.linux.no/linux+v2.6.30/fs/reiserfs/README licensed] under the [http://www.gnu.org/licenses/gpl-2.0.html GPLv2], hundreds of anonymous contributers should be on this list too: {| | Hans Reiser || Owner, Architect, Manager, Programmer. |- | Vladimir Saveliev || Lead Programmer. |- | Alexander "Zam" Zarochentcev || Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. |- | Vladimir Demidov || Programmer, General Director of [http://namesys.ru namesys.ru], alpha port, v4 parser. |- | Vitaly Fertman || Programmer. [[reiserfsck|fsck]]. |- | Edward Shishkin || Programmer. Encryption and Compression plugins. |- | Chris Mason || Programmer. Journaling code (formerly SuSE, now at Oracle) |- | Jeff Mahoney || SuSE programmer, port to Power PC and sparc, bitmap code. |- | Elena Gryaznova || Testing and benchmarking. |- | Lex "FLX" Lyamin || Hostmaster and Sysadmin |- | George Beshers || Team Leader of the Masks and Process Oriented Security DARPA Project |- | Nate Diller || Programmer, Masks and Process Oriented Security DARPA Project |} === Former Contributors === {| | Ramon Reiser || Marketing, Instructional Technology, Technical Writing. |- | Yura Umanets || Programmer. [[reiserfsprogs]] |- | Josh MacDonald || Programmer. Transaction Manager. Flush Code |- | Yury Rupasov || Programmer. |- | Yuri "Sizif" Shevchuk || Programmer. |- | Jeremy Fitzhardinge || Volunteer. Author of hashing code. (<tt>teahash.c</tt>) |- | Roman Pozlevich || Programmer. |- | Nikita Danilov || Balancing, plugins. |- | Oleg "Green" Drokin || Release Manager. |} === Companies === A few companies sponsored certain parts of ReiserFS/Reiser4 development: * The [http://www.darpa.mil/ Defense Advanced Research Projects Agency] was the primary sponsor of Reiser4 (DARPA does not endorse this project; it merely sponsored it.) * Journal relocation and resizing was sponsored by [http://www.applianceware.com ApplianceWare] * [[HSM]] <!-- Hierarchial Storage Management ? --> sponsored by [http://www.bigstorage.com/ BigStorage, Inc.] * Journaling sponsored by [http://www.suse.com SuSE] (originally sponsored by [http://mp3.com mp3.com]) * Debugging sponsored by [http://linspire.com/reiserlink Linspire] (now Xandros) [[category:Reiser4]] [[category:ReiserFS]] 33e61e299233ba50b3842f24ab8930a09df7f57f 1382 1380 2009-06-25T15:44:16Z Chris goe 2 link fixed The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since both filesystems are released under the [http://www.gnu.org/copyleft/gpl.html GPL], hundreds of anonymous contributers should be on this list too: <pre> Name Position Hans Reiser Owner, Architect, Manager, Programmer. Vladimir Saveliev Lead Programmer. Alexander "Zam" Zarochentcev Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. Vladimir Demidov Programmer, General Director of namesys.ru, alpha port, v4 parser. Vitaly Fertman Programmer. fsck. Edward Shishkin Programmer. Encryption and Compression plugins. Chris Mason Programmer. Journaling code. (SuSE staff) Jeff Mahoney SuSE programmer, port to Power PC and sparc, bitmap code. Elena Gryaznova Testing and benchmarking. Lex "FLX" Lyamin Hostmaster and Sysadmin George Beshers Team Leader of the Masks and Process Oriented Security DARPA Project Nate Diller Programmer, Masks and Process Oriented Security DARPA Project Former Contributors Ramon Reiser Marketing, Instructional Technology, Technical Writing. Yura Umanets Programmer. reiserfsprogs Josh MacDonald Programmer. Transaction Manager. Flush Code Yury Rupasov Programmer. Yuri "Sizif" Shevchuk Programmer. Jeremy Fitzhardinge Volunteer. Author of hashing code. (teahash.c) Roman Pozlevich Programmer. Nikita Danilov Balancing, plugins. Oleg "Green" Drokin Release Manager. </pre> === Companies === A few companies sponsored certain parts of ReiserFS/Reiser4 development: * The [http://www.darpa.mil/ Defense Advanced Research Projects Agency] was the primary sponsor of Reiser4 (DARPA does not endorse this project; it merely sponsored it.) * Journal relocation and resizing was sponsored by [http://www.applianceware.com ApplianceWare] * [[HSM]] <!-- Hierarchial Storage Management ? --> sponsored by [http://www.bigstorage.com/ BigStorage, Inc.] * Journaling sponsored by [http://www.suse.com SuSE] (originally sponsored by [http://mp3.com mp3.com]) * Debugging sponsored by [http://linspire.com/reiserlink Linspire] (now Xandros) [[category:Reiser4]] [[category:ReiserFS]] 78d3868b147bf48cbe143f4c5f5d076f1ca52a84 1380 1379 2009-06-25T10:15:43Z Chris goe 2 HSM? The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since both filesystems are released under the [http://www.gnu.org/copyleft/gpl.html GPL], hundreds of anonymous contributers should be on this list too: <pre> Name Position Hans Reiser Owner, Architect, Manager, Programmer. Vladimir Saveliev Lead Programmer. Alexander "Zam" Zarochentcev Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. Vladimir Demidov Programmer, General Director of namesys.ru, alpha port, v4 parser. Vitaly Fertman Programmer. fsck. Edward Shishkin Programmer. Encryption and Compression plugins. Chris Mason Programmer. Journaling code. (SuSE staff) Jeff Mahoney SuSE programmer, port to Power PC and sparc, bitmap code. Elena Gryaznova Testing and benchmarking. Lex "FLX" Lyamin Hostmaster and Sysadmin George Beshers Team Leader of the Masks and Process Oriented Security DARPA Project Nate Diller Programmer, Masks and Process Oriented Security DARPA Project Former Contributors Ramon Reiser Marketing, Instructional Technology, Technical Writing. Yura Umanets Programmer. reiserfsprogs Josh MacDonald Programmer. Transaction Manager. Flush Code Yury Rupasov Programmer. Yuri "Sizif" Shevchuk Programmer. Jeremy Fitzhardinge Volunteer. Author of hashing code. (teahash.c) Roman Pozlevich Programmer. Nikita Danilov Balancing, plugins. Oleg "Green" Drokin Release Manager. </pre> === Companies === A few companies sponsored certain parts of ReiserFS/Reiser4 development: * The [http://www.darpa.mil/ Defense Advanced Research Projects Agency was the primary sponsor of Reiser4] (DARPA does not endorse this project; it merely sponsored it.) * Journal relocation and resizing was sponsored by [http://www.applianceware.com ApplianceWare] * [[HSM]] <!-- Hierarchial Storage Management ? --> sponsored by [http://www.bigstorage.com/ BigStorage, Inc.] * Journaling sponsored by [http://www.suse.com SuSE] (originally sponsored by [http://mp3.com mp3.com]) * Debugging sponsored by [http://linspire.com/reiserlink Linspire] (now Xandros) [[category:Reiser4]] [[category:ReiserFS]] 2f423ea417bdcd07c61b8a5b0c18534921a30faf 1379 1346 2009-06-25T10:13:48Z Chris goe 2 companies added The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since both filesystems are released under the [http://www.gnu.org/copyleft/gpl.html GPL], hundreds of anonymous contributers should be on this list too: <pre> Name Position Hans Reiser Owner, Architect, Manager, Programmer. Vladimir Saveliev Lead Programmer. Alexander "Zam" Zarochentcev Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. Vladimir Demidov Programmer, General Director of namesys.ru, alpha port, v4 parser. Vitaly Fertman Programmer. fsck. Edward Shishkin Programmer. Encryption and Compression plugins. Chris Mason Programmer. Journaling code. (SuSE staff) Jeff Mahoney SuSE programmer, port to Power PC and sparc, bitmap code. Elena Gryaznova Testing and benchmarking. Lex "FLX" Lyamin Hostmaster and Sysadmin George Beshers Team Leader of the Masks and Process Oriented Security DARPA Project Nate Diller Programmer, Masks and Process Oriented Security DARPA Project Former Contributors Ramon Reiser Marketing, Instructional Technology, Technical Writing. Yura Umanets Programmer. reiserfsprogs Josh MacDonald Programmer. Transaction Manager. Flush Code Yury Rupasov Programmer. Yuri "Sizif" Shevchuk Programmer. Jeremy Fitzhardinge Volunteer. Author of hashing code. (teahash.c) Roman Pozlevich Programmer. Nikita Danilov Balancing, plugins. Oleg "Green" Drokin Release Manager. </pre> === Companies === A few companies sponsored certain parts of ReiserFS/Reiser4 development: * The [http://www.darpa.mil/ Defense Advanced Research Projects Agency was the primary sponsor of Reiser4] (DARPA does not endorse this project; it merely sponsored it.) * Journal relocation and resizing was sponsored by [http://www.applianceware.com ApplianceWare] * [[HSM]] sponsored by [http://www.bigstorage.com/ BigStorage, Inc.] * Journaling sponsored by [http://www.suse.com SuSE] (originally sponsored by [http://mp3.com mp3.com]) * Debugging sponsored by [http://linspire.com/reiserlink Linspire] (now Xandros) [[category:Reiser4]] [[category:ReiserFS]] 5f757eff6daf2efcc3cb0f4206e1f06cef48f847 1346 1345 2009-06-25T08:15:50Z Chris goe 2 The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since both filesystems are released under the [http://www.gnu.org/copyleft/gpl.html GPL], hundreds of anonymous contributers should be on this list too: <pre> Name Position Hans Reiser Owner, Architect, Manager, Programmer. Vladimir Saveliev Lead Programmer. Alexander "Zam" Zarochentcev Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. Vladimir Demidov Programmer, General Director of namesys.ru, alpha port, v4 parser. Vitaly Fertman Programmer. fsck. Edward Shishkin Programmer. Encryption and Compression plugins. Chris Mason Programmer. Journaling code. (SuSE staff) Jeff Mahoney SuSE programmer, port to Power PC and sparc, bitmap code. Elena Gryaznova Testing and benchmarking. Lex "FLX" Lyamin Hostmaster and Sysadmin George Beshers Team Leader of the Masks and Process Oriented Security DARPA Project Nate Diller Programmer, Masks and Process Oriented Security DARPA Project Former Contributors Ramon Reiser Marketing, Instructional Technology, Technical Writing. Yura Umanets Programmer. reiserfsprogs Josh MacDonald Programmer. Transaction Manager. Flush Code Yury Rupasov Programmer. Yuri "Sizif" Shevchuk Programmer. Jeremy Fitzhardinge Volunteer. Author of hashing code. (teahash.c) Roman Pozlevich Programmer. Nikita Danilov Balancing, plugins. Oleg "Green" Drokin Release Manager. </pre> [[category:Reiser4]] [[category:ReiserFS]] 86cb5e3e336a5dc62b64283a8f0f69bf76ce5b7a 1345 2009-06-25T08:14:27Z Chris goe 2 from: http://web.archive.org/web/20061113154634/www.namesys.com/devels.html The following people were the main developers for [[ReiserFS]] resp. [[Reiser4]]. Some are still active contributers, others are not. Since both filesystems are released under the [http://www.gnu.org/copyleft/gpl.html GPL], hundreds of anonymous contributers should be on this list too: <pre> Name Position Hans Reiser Owner, Architect, Manager, Programmer. Vladimir Saveliev Lead Programmer. Alexander "Zam" Zarochentcev Distinguished Programmer. Locking, flushing, block allocator, resizer, plugins. Vladimir Demidov Programmer, General Director of namesys.ru, alpha port, v4 parser. Vitaly Fertman Programmer. fsck. Edward Shishkin Programmer. Encryption and Compression plugins. Chris Mason Programmer. Journaling code. (SuSE staff) Jeff Mahoney SuSE programmer, port to Power PC and sparc, bitmap code. Elena Gryaznova Testing and benchmarking. Lex "FLX" Lyamin Hostmaster and Sysadmin George Beshers Team Leader of the Masks and Process Oriented Security DARPA Project Nate Diller Programmer, Masks and Process Oriented Security DARPA Project Former Contributors Ramon Reiser Marketing, Instructional Technology, Technical Writing. Yura Umanets Programmer. reiserfsprogs Josh MacDonald Programmer. Transaction Manager. Flush Code Yury Rupasov Programmer. Yuri "Sizif" Shevchuk Programmer. Jeremy Fitzhardinge Volunteer. Author of hashing code. (teahash.c) Roman Pozlevich Programmer. Nikita Danilov Balancing, plugins. Oleg "Green" Drokin Release Manager. </pre> f63a6c8168799a5f2e3d2fd9af4a831dcaa07a50 Debug Reiser4 0 1105 4247 4239 2017-06-20T23:25:35Z Chris goe 2 category added = How to collect kernel oops messages = Glossary: Testing machine is a machine where you test Reiser4; Control machine is a machine where you collect oops messages. NOTE: testing machine and control machine should be different machines. Steps: 1. Connect testing and control machines with a serial (null-modem) cable. 2. Find out device names for the used serial ports on the testing and control machines. Suppose it is /dev/ttyS0 for the testing machine and /dev/ttyS1 for the control machine. 3. On the control machine open a console and type the following (as root): # stty ispeed 115200 ospeed 115200 -F /dev/ttyS1 # cat /dev/ttyS1 4. Restart the testing machine passing "console=ttyS0,115200" along with other kernel parameters to the boot loader (for example by editing grub.cfg file). You should be able to see boot messages on the console opened at step (3) on the control machine. 5. Run any stress tests on the testing machine. On the console opened at step (3) you will be able to collect any system information including oops messages, etc. 6. Send the oops messages to <code>reiserfs-devel</code> [[Mailinglists|mailing list]]. Comment. If testing, and(or) control machine machine doesn't have a serial port, then it is possible to use USB-to-serial adapter(s). In this case the name of device will look like /dev/ttyUSB0. Comment. You can also use minicom utility on the control machine to collect oops messages. [[category:Reiser4]] bb6c2078970c4da46576d17be058c6a13ff0c2cd 4239 4237 2017-06-20T13:26:47Z Edward 4 /* How to collect kernel oops messages */ = How to collect kernel oops messages = Glossary: Testing machine is a machine where you test Reiser4; Control machine is a machine where you collect oops messages. NOTE: testing machine and control machine should be different machines. Steps: 1. Connect testing and control machines with a serial (null-modem) cable. 2. Find out device names for the used serial ports on the testing and control machines. Suppose it is /dev/ttyS0 for the testing machine and /dev/ttyS1 for the control machine. 3. On the control machine open a console and type the following (as root): # stty ispeed 115200 ospeed 115200 -F /dev/ttyS1 # cat /dev/ttyS1 4. Restart the testing machine passing "console=ttyS0,115200" along with other kernel parameters to the boot loader (for example by editing grub.cfg file). You should be able to see boot messages on the console opened at step (3) on the control machine. 5. Run any stress tests on the testing machine. On the console opened at step (3) you will be able to collect any system information including oops messages, etc. 6. Send the oops messages to reiserfs-devel mailing list. Comment. If testing, and(or) control machine machine doesn't have a serial port, then it is possible to use USB-to-serial adapter(s). In this case the name of device will look like /dev/ttyUSB0. Comment. You can also use minicom utility on the control machine to collect oops messages. bb94359d8eb195f594004030940d32fbc39cd0b8 4237 4233 2017-06-20T13:25:17Z Edward 4 /* How to collect kernel oops messages */ = How to collect kernel oops messages = Glossary: Testing machine is a machine where you test Reiser4; Control machine is a machine where you collect oops messages. NOTE: testing machine and control machine should be different machines. Steps: 1. Connect testing and control machines with a serial (null-modem) cable. 2. Find out device names for the used serial ports on the testing and control machines. Suppose it is /dev/ttyS0 for the testing machine and /dev/ttyS1 for the control machine. 3. On the control machine open a console and type the following (as root): # stty ispeed 115200 ospeed 115200 -F /dev/ttyS1 # cat /dev/ttyS1 4. Restart the testing machine passing "console=ttyS0,115200" along with other kernel parameters to the boot loader (for example by editing grub.cfg file). You should be able to see boot messages on the console opened at step (3) on the control machine. 5. Run any stress tests on the testing machine. On the console opened at step (3) you will be able to collect any system information including oops messages, etc. 6. Send the oops messages to reiserfs-devel mailing list. Comment. If testing, and(or) control machine machine doesn't have a serial port, then it is possible to use USB-to-serial adapter(s). In this case the name of device will look like /dev/ttyUSB0. Comment. You can also use minicom utility to collect oops messages. e2c3ae4ee0d17eecb7d1cdfab855f616b99cd116 4233 2017-06-20T13:20:25Z Edward 4 Add "Debug Reiser4" page = How to collect kernel oops messages = Glossary: Testing machine is a machine where you test Reiser4; Control machine is a machine where you collect oops messages. NOTE: testing machine and control machine should be different machines. Steps: 1. Connect testing and control machines with a serial (null-modem) cable. 2. Find out device names for the used serial ports on the testing and control machines. Suppose it is /dev/ttyS0 for the testing machine and /dev/ttyS1 for the control machine. 3. On the control machine open a console and type the following (as root): # stty ispeed 115200 ospeed 115200 -F /dev/ttyS1 # cat /dev/ttyS1 4. Restart the testing machine passing "console=ttyS0,115200" along with other kernel parameters to the boot loader (for example by editing grub.cfg file). You should be able to see boot messages on the console opened at step (3) on the control machine. 5. Run any stress tests on the testing machine. On the console opened at step (3) you will be able to collect any system information including oops messages, etc. 6. Send the oops messages to reiserfs-devel mailing list. Comment. If testing, and(or) control machine machine doesn't have a serial port, then it is possible to use USB-to-serial adapter(s). In this case the name of device will look like /dev/ttyUSB0. Comment. You can use minicom utility instead of stty to collect oops messages. ee0bbb2c4a4f0bcdd194397619e35e163341c10b Debug Reiser4progs 0 1107 4265 4257 2017-06-25T14:10:18Z Edward 4 Root permissions are not needed to perform make in reiser4progs. = Debug Reiser4progs with GDB = 1. Make sure you have the latest version of [[Reiser4progs]]: $ git clone https://github.com/edward6/reiser4progs 2. Compile and build static binaries with debugging symbols: $ cd reiser4progs $ ./prepare $ ./configure --enable-debug --enable-full-static $ make Troubleshooting. If you use Fedora distro, then make command can fail with the following: /usr/bin/ld: cannot find -luuid" during compilation Possible solution: $ cd /usr/lib64 $ sudo mv libuuid.so libuuid.so_ $ sudo ln -s libossp-uuid.so libuuid.so 3. Run gdb against needed binary that can be found in ./progs directory. For example: $ gdb progs/fsck/fsck.reiser4 [[category:Reiser4]] b1cecd6d3a68bdabd7acead9252a450ce6510f2e 4257 4253 2017-06-25T00:18:58Z Chris goe 2 link to [[Reiser4progs]] article; use sudo / don't run make as root = Debug Reiser4progs with GDB = 1. Make sure you have the latest version of [[Reiser4progs]]: $ git clone https://github.com/edward6/reiser4progs 2. Compile and build static binaries with debugging symbols: $ cd reiser4progs $ ./prepare $ ./configure --enable-debug --enable-full-static $ sudo make Troubleshooting. If you use Fedora distro, then make command can fail with the following: /usr/bin/ld: cannot find -luuid" during compilation Possible solution: # cd /usr/lib64 # sudo mv libuuid.so libuuid.so_ # sudo ln -s libossp-uuid.so libuuid.so 3. Run gdb against needed binary that can be found in ./progs directory. For example: # gdb progs/fsck/fsck.reiser4 [[category:Reiser4]] 70a5e38177d1974583f67c7a7e8f3e63d3e32be6 4253 4251 2017-06-24T21:27:38Z Edward 4 add category = Debug Reiser4progs with GDB = 1. Make sure you have the latest version of Reiser4progs: $ git clone https://github.com/edward6/reiser4progs 2. Compile and build static binaries with debugging symbols: $ cd reiser4progs $ ./prepare $ ./configure --enable-debug --enable-full-static $ make Troubleshooting. If you use Fedora distro, then make command can fail with the following: /usr/bin/ld: cannot find -luuid" during compilation Possible solution: # cd /usr/lib64 # mv libuuid.so libuuid.so_ # ln -s libossp-uuid.so libuuid.so 3. Run gdb against needed binary that can be found in ./progs directory. For example: # gdb progs/fsck/fsck.reiser4 [[category:Reiser4]] 8021048a0357ccee6cfe176df82efecb180eb7be 4251 2017-06-24T21:25:54Z Edward 4 Added Debug_Reiser4progs page = Debug Reiser4progs with GDB = 1. Make sure you have the latest version of Reiser4progs: $ git clone https://github.com/edward6/reiser4progs 2. Compile and build static binaries with debugging symbols: $ cd reiser4progs $ ./prepare $ ./configure --enable-debug --enable-full-static $ make Troubleshooting. If you use Fedora distro, then make command can fail with the following: /usr/bin/ld: cannot find -luuid" during compilation Possible solution: # cd /usr/lib64 # mv libuuid.so libuuid.so_ # ln -s libossp-uuid.so libuuid.so 3. Run gdb against needed binary that can be found in ./progs directory. For example: # gdb progs/fsck/fsck.reiser4 83a4a60296812aed58954150a7bf546723ba2b9e Debugfs.reiser4 0 84 1683 1676 2010-02-10T11:33:59Z Chris goe 2 category added === NAME === debugfs.reiser4 - The program for debugging reiser4 filesystem. === SYNOPSIS === debugfs.reiser4 [ options ] FILE === DESCRIPTION === debugfs.reiser4 is reiser4 filesystem debug program. You can discover the internal reiser4 filesystem structures by using it. === COMMON OPTIONS === -V, --version prints program version. -?, -h, --help prints program help. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces debugfs to use whole disk, not block device or mounted partition. -c, --cache N sets tree cache node number to passed value. This affects very much behavior of libreiser4. It affects speed, tree allocation, etc. === BROWSING OPTIONS === -k, --cat browses passed file like standard cat and ls programs. === PRINT OPTIONS === -t, --print-tree prints the internal tree. -b, --print-block N prints the block associated with the passed block number. -n, --print-nodes FILE prints all nodes that the passed file lies in. -i, --print-file prints all items that the passed file consists of. -s, --print-super prints the both super blocks: master super block and format specific one. -j, --print-journal prints the journal with not yet commited transactions (if any). -a, --print-alloc prints the block allocator data. -d, --print-oid prints the oid allocator data. === METADATA OPTIONS === -P, --pack-metadata fetches filesystem metadata and writes them to the standard output. -U, --unpack-metadata reads filesystem metadata stream from the stdandard input and constructs a new filesystem based on the metadata. debugfs.reiser4 --pack-metadata <FS1> | debugfs.reiser4 --unpack-metadata <FS2> and then debugfs.reiser4 --pack-metadata <FS2> produces a stream equivalent to the first one. === PLUGIN OPTIONS === -p, --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 knows about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === EXAMPLES === debugfs.reiser4 -o nodeptr=nodeptr41,hash=rupasov_hash /dev/hda2 === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[measurefs.reiser4|measurefs.reiser4(8)]] * [[mkfs.reiser4|mkfs.reiser4(8)]] * [[fsck.reiser4|fsck.reiser4(8)]] === AUTHOR === This manual page was written by Yury Umanets <umka@namesys.com> [[category:Reiser4]] 24258e7044dc9f05de9bf47c97ce659c5ccf4c8f 1676 2010-02-10T10:59:21Z Chris goe 2 Created page with '=== NAME === debugfs.reiser4 - The program for debugging reiser4 filesystem. === SYNOPSIS === debugfs.reiser4 [ options ] FILE === DESCRIPTION === debugfs.reiser4 i…' === NAME === debugfs.reiser4 - The program for debugging reiser4 filesystem. === SYNOPSIS === debugfs.reiser4 [ options ] FILE === DESCRIPTION === debugfs.reiser4 is reiser4 filesystem debug program. You can discover the internal reiser4 filesystem structures by using it. === COMMON OPTIONS === -V, --version prints program version. -?, -h, --help prints program help. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces debugfs to use whole disk, not block device or mounted partition. -c, --cache N sets tree cache node number to passed value. This affects very much behavior of libreiser4. It affects speed, tree allocation, etc. === BROWSING OPTIONS === -k, --cat browses passed file like standard cat and ls programs. === PRINT OPTIONS === -t, --print-tree prints the internal tree. -b, --print-block N prints the block associated with the passed block number. -n, --print-nodes FILE prints all nodes that the passed file lies in. -i, --print-file prints all items that the passed file consists of. -s, --print-super prints the both super blocks: master super block and format specific one. -j, --print-journal prints the journal with not yet commited transactions (if any). -a, --print-alloc prints the block allocator data. -d, --print-oid prints the oid allocator data. === METADATA OPTIONS === -P, --pack-metadata fetches filesystem metadata and writes them to the standard output. -U, --unpack-metadata reads filesystem metadata stream from the stdandard input and constructs a new filesystem based on the metadata. debugfs.reiser4 --pack-metadata <FS1> | debugfs.reiser4 --unpack-metadata <FS2> and then debugfs.reiser4 --pack-metadata <FS2> produces a stream equivalent to the first one. === PLUGIN OPTIONS === -p, --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 knows about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === EXAMPLES === debugfs.reiser4 -o nodeptr=nodeptr41,hash=rupasov_hash /dev/hda2 === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[measurefs.reiser4|measurefs.reiser4(8)]] * [[mkfs.reiser4|mkfs.reiser4(8)]] * [[fsck.reiser4|fsck.reiser4(8)]] === AUTHOR === This manual page was written by Yury Umanets <umka@namesys.com> 642f905c4ecd676ea49f49eb16905ec48a01d8c3 Debugreiserfs 0 27 1553 1528 2009-07-02T20:00:05Z Chris goe 2 listaddress added === NAME === debugreiserfs - The debugging tool for the [[ReiserFS]] filesystem. === SYNOPSIS === debugreiserfs [ -dDJmoqpuSV ] [ -j device ] [ -B file ] [ -1 N ] ''device'' === DESCRIPTION === <tt>debugreiserfs</tt> sometimes helps to solve problems with ReiserFS filesystems. When run without options it prints the super block of the ReiserFS filesystem found on the ''device''. ''device'' is the special file corresponding to the device (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === -j device prints the contents of the journal. The option -p allows it to pack the journal with other metadata into the archive. -J prints the journal header. -d prints the formatted nodes of the internal tree of the filesystem. -D prints the formatted nodes of all used blocks of the filesystem. -m prints the contents of the bitmap (slightly useful). -o prints the objectid map (slightly useful). -B ''file'' takes the list of [[FAQ/bad-block-handling|bad blocks]] stored in the internal ReiserFS tree and translates it into an ascii list written to the specified ''file''. -1 blocknumber prints the specified block of the filesystem. -p extracts the filesystem's metadata with debugreiserfs -p /dev/xxx | gzip -c > xxx.gz. None of your data are packed unless a filesystem corruption presents when the whole block having this corruption is packed. You [[mailinglists|send us the output]], and we use it to create a filesystem with the same strucure as yours using <tt>debugreiserfs -u</tt>. When the data file is not too large, this usually allows us to quickly reproduce and debug the problem. -u builds the ReiserFS filesystem image with gunzip -c xxx.gz | debugreiserfs -u /dev/image of the previously packed metadata with debugreiserfs -p. The result image is not the same as the original filesystem, because mostly only metadata were packed with debugreiserfs -p, but the filesystem structure is completely recreated. -S When -S is not specified, -p deals with blocks marked used in the filesystem bitmap only. With this option set debugreiserfs will work with the entire device. -q When -p is in use, suppress showing the speed of progress. === AUTHOR === This version of <tt>debugreiserfs</tt> has been written by Vitaly Fertman. === BUGS === Please report bugs to the ReiserFS developers {{listaddress}}, providing as much information as possible - your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] 649679165cff81bf98d038d91b0a5320f415b21c 1528 1527 2009-06-27T19:57:59Z Chris goe 2 minor formatting fixes === NAME === debugreiserfs - The debugging tool for the [[ReiserFS]] filesystem. === SYNOPSIS === debugreiserfs [ -dDJmoqpuSV ] [ -j device ] [ -B file ] [ -1 N ] ''device'' === DESCRIPTION === <tt>debugreiserfs</tt> sometimes helps to solve problems with ReiserFS filesystems. When run without options it prints the super block of the ReiserFS filesystem found on the ''device''. ''device'' is the special file corresponding to the device (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === -j device prints the contents of the journal. The option -p allows it to pack the journal with other metadata into the archive. -J prints the journal header. -d prints the formatted nodes of the internal tree of the filesystem. -D prints the formatted nodes of all used blocks of the filesystem. -m prints the contents of the bitmap (slightly useful). -o prints the objectid map (slightly useful). -B ''file'' takes the list of [[FAQ/bad-block-handling|bad blocks]] stored in the internal ReiserFS tree and translates it into an ascii list written to the specified ''file''. -1 blocknumber prints the specified block of the filesystem. -p extracts the filesystem's metadata with debugreiserfs -p /dev/xxx | gzip -c > xxx.gz. None of your data are packed unless a filesystem corruption presents when the whole block having this corruption is packed. You [[mailinglists|send us the output]], and we use it to create a filesystem with the same strucure as yours using <tt>debugreiserfs -u</tt>. When the data file is not too large, this usually allows us to quickly reproduce and debug the problem. -u builds the ReiserFS filesystem image with gunzip -c xxx.gz | debugreiserfs -u /dev/image of the previously packed metadata with debugreiserfs -p. The result image is not the same as the original filesystem, because mostly only metadata were packed with debugreiserfs -p, but the filesystem structure is completely recreated. -S When -S is not specified, -p deals with blocks marked used in the filesystem bitmap only. With this option set debugreiserfs will work with the entire device. -q When -p is in use, suppress showing the speed of progress. === AUTHOR === This version of <tt>debugreiserfs</tt> has been written by Vitaly Fertman. === BUGS === Please [[mailinglists|report bugs to the ReiserFS developers]], providing as much information as possible - your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] 312fc8ae31b81ba4f703512df2f5a7b27ffa8fd8 1527 1360 2009-06-27T19:56:27Z Chris goe 2 formatting fixes === NAME === debugreiserfs - The debugging tool for the [[ReiserFS]] filesystem. === SYNOPSIS === debugreiserfs [ -dDJmoqpuSV ] [ -j device ] [ -B file ] [ -1 N ] ''device'' === DESCRIPTION === <tt>debugreiserfs</tt> sometimes helps to solve problems with ReiserFS filesystems. When run without options it prints the super block of the ReiserFS filesystem found on the ''device''. ''device'' is the special file corresponding to the device (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === -j device prints the contents of the journal. The option -p allows it to pack the journal with other metadata into the archive. -J prints the journal header. -d prints the formatted nodes of the internal tree of the filesystem. -D prints the formatted nodes of all used blocks of the filesystem. -m prints the contents of the bitmap (slightly useful). -o prints the objectid map (slightly useful). -B ''file'' takes the list of [[FAQ/bad-block-handling|bad blocks]] stored in the internal ReiserFS tree and translates it into an ascii list written to the specified ''file''. -1 blocknumber prints the specified block of the filesystem. -p extracts the filesystem's metadata with debugreiserfs -p /dev/xxx | gzip -c > xxx.gz. None of your data are packed unless a filesystem corruption presents when the whole block having this corruption is packed. You [[mailinglists|send us the output]], and we use it to create a filesystem with the same strucure as yours using <tt>debugreiserfs -u</tt>. When the data file is not too large, this usually allows us to quickly reproduce and debug the problem. -u builds the ReiserFS filesystem image with gunzip -c xxx.gz | debugreiserfs -u /dev/image of the previously packed metadata with debugreiserfs -p. The result image is not the same as the original filesystem, because mostly only metadata were packed with debugreiserfs -p, but the filesystem structure is completely recreated. -S When -S is not specified, -p deals with blocks marked used in the filesystem bitmap only. With this option set debugreiserfs will work with the entire device. -q When -p is in use, suppress showing the speed of progress. === AUTHOR === This version of <tt>debugreiserfs</tt> has been written by Vitaly Fertman. === BUGS === Please [[mailinglists|report bugs to the ReiserFS developers]], providing as much information as possible - your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] 17fee8e84381f2dbacbea9f569ef1e56dcd779e0 1360 1335 2009-06-25T09:04:47Z Chris goe 2 category added === NAME === debugreiserfs - The debugging tool for the ReiserFS filesystem. === SYNOPSIS === debugreiserfs [ -dDJmoqpuSV ] [ -j device ] [ -B file ] [ -1 N ] device === DESCRIPTION === debugreiserfs sometimes helps to solve problems with reiserfs filesystems. When run without options it prints the super block of the ReiserFS filesystem found on the device. device is the special file corresponding to the device (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === -j device prints the contents of the journal. The option -p allows it to pack the journal with other metadata into the archive. -J prints the journal header. -d prints the formatted nodes of the internal tree of the filesystem. -D prints the formatted nodes of all used blocks of the filesystem. -m prints the contents of the bitmap (slightly useful). -o prints the objectid map (slightly useful). -B file takes the list of bad blocks stored in the internal ReiserFS tree and translates it into an ascii list written to the specified file. -1 blocknumber prints the specified block of the filesystem. -p extracts the filesystem's metadata with debugreiserfs -p /dev/xxx | gzip -c > xxx.gz. None of your data are packed unless a filesystem corruption presents when the whole block having this corruption is packed. You send us the output, and we use it to create a filesystem with the same strucure as yours using debugreiserfs -u. When the data file is not too large, this usually allows us to quickly reproduce and debug the problem. -u builds the ReiserFS filesystem image with gunzip -c xxx.gz | debugreiserfs -u /dev/image of the previously packed metadata with debugreiserfs -p. The result image is not the same as the original filesystem, because mostly only metadata were packed with debugreiserfs -p, but the filesystem structure is completely recreated. -S When -S is not specified -p deals with blocks marked used in the filesystem bitmap only. With this option set debugreiserfs will work with the entire device. -q When -p is in use, suppress showing the speed of progress. === AUTHOR === This version of debugreiserfs has been written by Vitaly Fertman <vitaly@namesys.com>. === BUGS === Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === [[reiserfsck|reiserfsck(8)]], [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] d232d67d7dd3b29dd2c2a1f4246ee3c0bbf784e4 1335 2009-06-25T07:54:28Z Chris goe 2 http://web.archive.org/web/20061113154850/www.namesys.com/debugreiserfs.html === NAME === debugreiserfs - The debugging tool for the ReiserFS filesystem. === SYNOPSIS === debugreiserfs [ -dDJmoqpuSV ] [ -j device ] [ -B file ] [ -1 N ] device === DESCRIPTION === debugreiserfs sometimes helps to solve problems with reiserfs filesystems. When run without options it prints the super block of the ReiserFS filesystem found on the device. device is the special file corresponding to the device (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === -j device prints the contents of the journal. The option -p allows it to pack the journal with other metadata into the archive. -J prints the journal header. -d prints the formatted nodes of the internal tree of the filesystem. -D prints the formatted nodes of all used blocks of the filesystem. -m prints the contents of the bitmap (slightly useful). -o prints the objectid map (slightly useful). -B file takes the list of bad blocks stored in the internal ReiserFS tree and translates it into an ascii list written to the specified file. -1 blocknumber prints the specified block of the filesystem. -p extracts the filesystem's metadata with debugreiserfs -p /dev/xxx | gzip -c > xxx.gz. None of your data are packed unless a filesystem corruption presents when the whole block having this corruption is packed. You send us the output, and we use it to create a filesystem with the same strucure as yours using debugreiserfs -u. When the data file is not too large, this usually allows us to quickly reproduce and debug the problem. -u builds the ReiserFS filesystem image with gunzip -c xxx.gz | debugreiserfs -u /dev/image of the previously packed metadata with debugreiserfs -p. The result image is not the same as the original filesystem, because mostly only metadata were packed with debugreiserfs -p, but the filesystem structure is completely recreated. -S When -S is not specified -p deals with blocks marked used in the filesystem bitmap only. With this option set debugreiserfs will work with the entire device. -q When -p is in use, suppress showing the speed of progress. === AUTHOR === This version of debugreiserfs has been written by Vitaly Fertman <vitaly@namesys.com>. === BUGS === Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === [[reiserfsck|reiserfsck(8)]], [[mkreiserfs|mkreiserfs(8)]] 74b18b0aec7463225dc9ce71f8779b2bfcd1b8f6 FAQ 0 1101 4205 2016-12-25T19:42:14Z DusanC 30310 DusanC moved page [[FAQ]] to [[FAQ ReiserFS]]: Make place to create separate Reiser4 and ReiserFS FAQs #REDIRECT [[FAQ ReiserFS]] e2a163e03176e8c0d9385f8796ded0827e63d0ac FAQ/bad-block-handling 0 45 1500 1479 2009-06-27T17:07:21Z Chris goe 2 badblocks is part of e2fsprogs Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [http://manpages.ubuntu.com/manpages/karmic/man8/badblocks.8.html badblocks] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do?|above]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[#How can I get the list of bad blocks on my harddrive?|above]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [http://e2fsprogs.sourceforge.net/e2fsprogs-release.html badblocks] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?|above]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks?|above]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?|above]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] 83edc3e2f543ba5af22256ed200e4928152a238f 1479 1478 2009-06-27T05:39:49Z Chris goe 2 . Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do?|above]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[#How can I get the list of bad blocks on my harddrive?|above]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?|above]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks?|above]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?|above]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] 05c86d1d25cf27ec67d870ffbd0af039e05837b1 1478 1477 2009-06-27T05:39:16Z Chris goe 2 added ? Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do?|above]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[#How can I get the list of bad blocks on my harddrive?|above]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?|above]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks?|above]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] 23c032175b1f37cc0b1a4cc82734317df849ae44 1477 1476 2009-06-27T05:38:46Z Chris goe 2 . Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do?|above]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[#How can I get the list of bad blocks on my harddrive?|above]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?|above]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks|above]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] d3549169c208da6937fafd1d610f9cef0a282379 1476 1475 2009-06-27T05:38:14Z Chris goe 2 . Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do?|above]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[#How can I get the list of bad blocks on my harddrive?|above]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?|above]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] b026c9d07840933924deda8c6103481add1cf3cc 1475 1474 2009-06-27T05:37:41Z Chris goe 2 . Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do?|above]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[#How can I get the list of bad blocks on my harddrive?|above]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] 013df44cadafcc5a7bf204f5d69600149e518976 1474 1470 2009-06-27T05:37:14Z Chris goe 2 . Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do?|above]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[#How can I get the list of bad blocks on my harddrive?]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] 4040025a9ee0347027966bc878ba07c1d478f34f 1470 1469 2009-06-27T04:54:29Z Chris goe 2 # Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[#How can I get the list of bad blocks on my harddrive?]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] cb8f93a76106ef7f4546baff9645aa07d67bf55d 1469 1468 2009-06-27T04:53:57Z Chris goe 2 anchors Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system ReiserFS area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see [[#I have bad blocks on my hard drive, what do I do]]), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see [[How can I get the list of bad blocks on my harddrive?]]), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive?]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] 00c381d4779de338aafc5778c3f3584bd1f42026 1468 1465 2009-06-27T04:50:01Z Chris goe 2 block is variable Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === Does my harddrive have bad blocks? How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system reiserfs area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see I have bad blocks on my hard drive, what do I do), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see How can I get the list of bad blocks on my harddrive), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see How can I get the list of bad blocks on my harddrive) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock /path_to/reiserfs-mount-point ''block'' used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] ecf5d5434005d85b7a9a5498d5741a73bc4bda03 1465 1464 2009-06-27T04:39:46Z Chris goe 2 sublink Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === Does my harddrive have bad blocks? How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system reiserfs area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see I have bad blocks on my hard drive, what do I do), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see How can I get the list of bad blocks on my harddrive), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see How can I get the list of bad blocks on my harddrive) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the program [[FAQ/bad-block-handling/reiserfs-add-badblock.c|reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock block used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] 9b679695fb039571a689352475ccc67a83c6e002 1464 1463 2009-06-27T04:38:56Z Chris goe 2 reiserfs-add-badblock.c Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === Does my harddrive have bad blocks? How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system reiserfs area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see I have bad blocks on my hard drive, what do I do), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see How can I get the list of bad blocks on my harddrive), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see How can I get the list of bad blocks on my harddrive) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the programm [[reiserfs-add-badblock.c]] as the following: reiserfs-add-badblock block used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] 2f4b92ef6a775ad51dc9d24c9550d07827f3e680 1463 1462 2009-06-27T04:35:05Z Chris goe 2 formatting fixes Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === Does my harddrive have bad blocks? How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system reiserfs area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see I have bad blocks on my hard drive, what do I do), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see How can I get the list of bad blocks on my harddrive), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see How can I get the list of bad blocks on my harddrive) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: [[reiserfstune]] --badblocks file ''device'' or [[reiserfstune]] --add-badblocks file ''device'' where <tt>"file"</tt> contains the list of blocks to be marked as bad. The <tt>--badblocks</tt> option clears the list of bad blocks on the ReiserFS before adding the given list as the list of bad blocks, whereas <tt>--add-badblocks</tt> just adds the list to the list of bad blocks on ReiserFS partition. If the ReiserFS has some corruptions and [[reiserfstune]] refuses to run, use [[reiserfsck]] instead (see [[#How can I check a ReiserFS filesystem with bad blocks]]). === How can I get the list of bad blocks saved in the reiserfs? === To get the list of blocks that are marked bad on ReiserFS partition, run [[debugreiserfs]] -B file ''device'' where <tt>''file''</tt> is the filename of the file where the list should be stored in. Remember that if the ReiserFS partition has fatal corruptions in the tree, the list of bad blocks can become unavailable (see [[#How can I get the list of bad blocks on my harddrive]]) === How can I mark a block as bad on a mounted ReiserFS? === You need to apply the following patch corresponding to your kernel version: * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.19-badblocks.diff linux-2.4.19-badblocks.diff] * [http://ftp.icm.edu.pl/packages/linux-reiserfs/misc-patches/linux-2.4.22-badblocks.diff linux-2.4.22-badblocks.diff] The patch provides the new <tt>ioctl()</tt> commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the programm [[Image:reiserfs-add-badblock]] as the following: reiserfs-add-badblock block used If you have the list of bad blocks of the block device saved in the file with the name file, and the ReiserFS on this block device is mounted to <tt>/path_to/reiserfs-mount-point</tt> you can use: while read r; do reiserfs-add-badblock /path_to/reiserfs-mount-point $r used done < ''file'' [[category:ReiserFS]] dafab25cea76fa7eb9c6743f6fe161f4d13b3c87 1462 1452 2009-06-27T04:19:28Z Chris goe 2 formatting fixes Bad block handling in [[ReiserFS]] is supported since [[reiserfsprogs]] v3.6.12-pre1. === Does my harddrive have bad blocks? How can I get the list of bad blocks on my harddrive? === To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] ''device'' the result is the list of bad blocks on the device, save it somewhere. Do not forget to specify the reiserfs-block-size to the [[badblocks]] program if you are going to use the result list of bad blocks with ReiserFS utilities. The default ReiserFS block size if 4k by default, you can also get it from [[debugreiserfs]] device output. === I have bad blocks on my hard drive, what do I do? === You can try to write to all bad blocks with [http://www.gnu.org/software/coreutils/ dd(1)] or [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. === I have bad blocks in the system reiserfs area, what do I do? === ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the ReiserFS system area -- super block, journal, bitmap. If the drive does not remap them (see I have bad blocks on my hard drive, what do I do), then you cannot use this partition with ReiserFS, use [http://www.garloff.de/kurt/linux/ddrescue/ dd_rescue] to make a backup, run [[reiserfsck]] on the backup. === How can I create a ReiserFS filesystem on the block device with bad blocks? === If you have the list of bad blocks of the device in the file (see How can I get the list of bad blocks on my harddrive), then you can use the following: [[mkreiserfs]] --badblocks file ''device'' Remember that the block size of the ReiserFS is 4k by default, specify the same block size to [[badblocks]] program. === How can I check a ReiserFS filesystem with bad blocks? === If you want to just check a filesystem, there should be no extra option to [[reiserfsck]]. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. If you need to rebuild a ReiserFS partition on the block device with bad blocks, you must '''ALWAYS''' specify the '''FULL''' list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file ''device'' where <tt>"file"</tt> contains the list of '''ALL''' bad blocks on the device. (see How can I get the list of bad blocks on my harddrive) === How can I adjust the bad block list on a ReiserFS partition? === If you need to adjust the list of bad blocks you can use: reiserfstune --badblocks file device or reiserfstune --add-badblocks file device where file contains the list of blocks to be marked as bad. The --badblocks option clears the list of bad blocks on the reiserfs before adding the given list as the list of bad blocks, whereas --add-badblocks just adds the list to the list of bad blocks on reiserfs partition. If the reiserfs has some corruptions and reiserfstune refuses to run, use reiserfsck instead (see How can I check a reiserfs filesystem with bad blocks'). How can I get the list of bad blocks saved in the reiserfs? To get the list of blocks that are marked bad on reiserfs partition, run debugreiserfs -B file device where file is the filename of the file where the list should be stored in. Remember that if the reiserfs partition has fatal corruptions in the tree, the list of bad blocks can become unavailable, see How can I get the list of bad blocks on my harddrive'. How can I mark a block as bad on a mounted reiserfs? You need to apply the following patch corresponding to your kernel version: linux-2.4.19-badblocks.diff linux-2.4.22-badblocks.diff The patch provides the new ioctl() commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the programm reiserfs-add-badblock as the following: reiserfs-add-badblock block used If you have the list of bad blocks of the block device saved in the file with the name file, and the reiserfs on this block device is mounted to /path_to/reiserfs-mount-point you can use: while read do reiserfs-add-badblock /path_to/reiserfs-mount-point $REPLY used done < file [[category:ReiserFS]] eabfd24eb5011454efe8f7d844e0d6141a13cc1b 1452 2009-06-27T03:47:59Z Chris goe 2 http://web.archive.org/web/20061113154756/www.namesys.com/bad-block-handling.html Bad block handling in ReiserFS. The described bad block handling is supported since reiserfsprogs-3.6.12-pre1. Questions 1. Does my harddrive have bad blocks? 2. How can I get the list of bad blocks on my harddrive? 3. I have bad blocks on my hard drive, what do I do? 4. I have bad blocks in the system reiserfs area, what do I do? 5. How can I create a reiserfs filesystem on the block device with bad blocks? 6. How can I check a reiserfs filesystem with bad blocks? 7. How can I adjust the bad block list on a reiserfs partition? 8. How can I get the list of bad blocks saved in the reiserfs? 9. How can I mark a block as bad on a mounted reiserfs? Answers Does my harddrive have bad blocks? How can I get the list of bad blocks on my harddrive? To figure out if the harddrive has bad blocks or not you can run /sbin/badblocks [-b <reiserfs-block-size>] device the result is the list of bad blocks on the device , save it somewhere. Do not forget to specify the reiserfs-block-size to the badblocks program if you are going to use the result list of bad blocks with reiserfs utilities. The default reiserfs block size if 4k by default, you can also get it from debugreiserfs device output. I have bad blocks on my hard drive, what do I do? You can try to write to all bad blocks with dd or dd_rescue program, the drive will probably be able to remap them to good ones (modern drives do this in response to write, but not reads). Understand that drives that start having problems with bad blocks very often rapidly decay and go bad, and consider buying a new drive to save yourself from experiencing that. I have bad blocks in the system reiserfs area, what do I do? ReiserFS can handle only those bad blocks that belong to the data area, and cannot handle bad blocks of the reiserfs system area -- super block, journal, bitmap. If the drive does not remap them (see I have bad blocks on my hard drive, what do I do), then you cannot use this partition with reiserfs, use dd_rescue to make a backup, run reiserfsck on the backup. How can I create a reiserfs filesystem on the block device with bad blocks? If you have the list of bad blocks of the device in the file (see How can I get the list of bad blocks on my harddrive), then you can use the following: mkreiserfs --badblocks file device Remember that the block size of the reiserfs is 4k by default, specify the same block size to badblocks program. How can I check a reiserfs filesystem with bad blocks? If you want to just check a filesystem, there should be no extra option to reiserfsck. If you need to fix the list of bad blocks on the reiserfs partition, use: reiserfsck --badblocks file device where file contains the list of ALL bad blocks on the device. If you need to rebuild a reiserfs partition on the block device with bad blocks, you ALWAYS must specify the FULL list of bad blocks: reiserfsck --rebuild-tree --bad-badblocks file device where file contains the list of ALL bad blocks on the device. (see How can I get the list of bad blocks on my harddrive) How can I adjust the bad block list on a reiserfs partition? If you need to adjust the list of bad blocks you can use: reiserfstune --badblocks file device or reiserfstune --add-badblocks file device where file contains the list of blocks to be marked as bad. The --badblocks option clears the list of bad blocks on the reiserfs before adding the given list as the list of bad blocks, whereas --add-badblocks just adds the list to the list of bad blocks on reiserfs partition. If the reiserfs has some corruptions and reiserfstune refuses to run, use reiserfsck instead (see How can I check a reiserfs filesystem with bad blocks'). How can I get the list of bad blocks saved in the reiserfs? To get the list of blocks that are marked bad on reiserfs partition, run debugreiserfs -B file device where file is the filename of the file where the list should be stored in. Remember that if the reiserfs partition has fatal corruptions in the tree, the list of bad blocks can become unavailable, see How can I get the list of bad blocks on my harddrive'. How can I mark a block as bad on a mounted reiserfs? You need to apply the following patch corresponding to your kernel version: linux-2.4.19-badblocks.diff linux-2.4.22-badblocks.diff The patch provides the new ioctl() commands for the ReiserFS that allows one to mark a given block as used/free in the block allocation bitmap without unmounting the filesystem. Then use the programm reiserfs-add-badblock as the following: reiserfs-add-badblock block used If you have the list of bad blocks of the block device saved in the file with the name file, and the reiserfs on this block device is mounted to /path_to/reiserfs-mount-point you can use: while read do reiserfs-add-badblock /path_to/reiserfs-mount-point $REPLY used done < file [[category:ReiserFS]] f0dff84a2580517f18b2d0faa424792c0028028f FAQ/bad-block-handling/reiserfs-add-badblock.c 0 47 1467 1466 2009-06-27T04:45:55Z Chris goe 2 category added, copyright <pre> <nowiki> /* -*- C -*- */ /* * reiserfs-add-badblock.c * Copyright: probably Vitaly Fertman <vitaly@namesys.com> */ #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <string.h> #include <errno.h> /* kludge to avoid including kernel headers */ /* from include/asm-i386/ioctl.h */ #define _IOC_NRBITS 8 #define _IOC_TYPEBITS 8 #define _IOC_SIZEBITS 14 #define _IOC_DIRBITS 2 #define _IOC_NRSHIFT 0 #define _IOC_TYPESHIFT (_IOC_NRSHIFT+_IOC_NRBITS) #define _IOC_SIZESHIFT (_IOC_TYPESHIFT+_IOC_TYPEBITS) #define _IOC_DIRSHIFT (_IOC_SIZESHIFT+_IOC_SIZEBITS) #define _IOC_WRITE 1U #define _IOC(dir,type,nr,size) \ (((dir) << _IOC_DIRSHIFT) | \ ((type) << _IOC_TYPESHIFT) | \ ((nr) << _IOC_NRSHIFT) | \ ((size) << _IOC_SIZESHIFT)) #define _IOW(type,nr,size) _IOC(_IOC_WRITE,(type),(nr),sizeof(size)) #define REISERFS_IOC_USED _IOW(0xCD,2,long) #define REISERFS_IOC_FREE _IOW(0xCD,3,long) #define REISERFS_IOC_BADCNT _IOW(0xCD,4,long) int main( int argc, char **argv ) { int return_code; return_code = 1; if( argc == 4 ) { int fd; fd = open( argv[ 1 ], O_RDONLY ); if( fd != -1 ) { unsigned long blocknr; char *end_ptr; blocknr = strtoul( argv[ 2 ], &end_ptr, 0 ); if( *end_ptr == 0 ) { int cmd; cmd = 0; if( !strcmp( argv[ 3 ], "free" ) ) { cmd = REISERFS_IOC_FREE; } else if( !strcmp( argv[ 3 ], "used" ) ) { cmd = REISERFS_IOC_USED; } else if( !strcmp( argv[ 3 ], "set" ) ) { cmd = REISERFS_IOC_BADCNT; } else { fprintf( stderr, "%s: third argument `%s' is neither \"free\" " "nor \"used\" or \"set\"", argv[ 0 ], argv[ 3 ] ); } if( cmd != 0 ) { if( ioctl( fd, cmd, blocknr ) == -1 ) { fprintf( stderr, "%s: ioctl: %s\n", argv[ 0 ], strerror( errno ) ); } else { return_code = 0; } } } else { fprintf( stderr, "%s: `%s' is not valid unsigned long integer\n", argv[ 0 ], argv[ 2 ] ); } } else { fprintf( stderr, "%s: cannot open `%s': %s\n", argv[ 0 ], argv[ 1 ], strerror( errno ) ); } } else { fprintf( stderr, "Usage: %s path blocknr free|used|set\n", argv[ 0 ] ); } return return_code; } </nowiki> </pre> [[category:ReiserFS]] af441b174f3b481ccf5d0d13260b84eb69cb05a9 1466 2009-06-27T04:40:36Z Chris goe 2 http://web.archive.org/web/20061113155047/www.namesys.com/reiserfs-add-badblock.c <pre> <nowiki> /* -*- C -*- */ /* reiserfs-add-badblock.c */ #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <string.h> #include <errno.h> /* kludge to avoid including kernel headers */ /* from include/asm-i386/ioctl.h */ #define _IOC_NRBITS 8 #define _IOC_TYPEBITS 8 #define _IOC_SIZEBITS 14 #define _IOC_DIRBITS 2 #define _IOC_NRSHIFT 0 #define _IOC_TYPESHIFT (_IOC_NRSHIFT+_IOC_NRBITS) #define _IOC_SIZESHIFT (_IOC_TYPESHIFT+_IOC_TYPEBITS) #define _IOC_DIRSHIFT (_IOC_SIZESHIFT+_IOC_SIZEBITS) #define _IOC_WRITE 1U #define _IOC(dir,type,nr,size) \ (((dir) << _IOC_DIRSHIFT) | \ ((type) << _IOC_TYPESHIFT) | \ ((nr) << _IOC_NRSHIFT) | \ ((size) << _IOC_SIZESHIFT)) #define _IOW(type,nr,size) _IOC(_IOC_WRITE,(type),(nr),sizeof(size)) #define REISERFS_IOC_USED _IOW(0xCD,2,long) #define REISERFS_IOC_FREE _IOW(0xCD,3,long) #define REISERFS_IOC_BADCNT _IOW(0xCD,4,long) int main( int argc, char **argv ) { int return_code; return_code = 1; if( argc == 4 ) { int fd; fd = open( argv[ 1 ], O_RDONLY ); if( fd != -1 ) { unsigned long blocknr; char *end_ptr; blocknr = strtoul( argv[ 2 ], &end_ptr, 0 ); if( *end_ptr == 0 ) { int cmd; cmd = 0; if( !strcmp( argv[ 3 ], "free" ) ) { cmd = REISERFS_IOC_FREE; } else if( !strcmp( argv[ 3 ], "used" ) ) { cmd = REISERFS_IOC_USED; } else if( !strcmp( argv[ 3 ], "set" ) ) { cmd = REISERFS_IOC_BADCNT; } else { fprintf( stderr, "%s: third argument `%s' is neither \"free\" " "nor \"used\" or \"set\"", argv[ 0 ], argv[ 3 ] ); } if( cmd != 0 ) { if( ioctl( fd, cmd, blocknr ) == -1 ) { fprintf( stderr, "%s: ioctl: %s\n", argv[ 0 ], strerror( errno ) ); } else { return_code = 0; } } } else { fprintf( stderr, "%s: `%s' is not valid unsigned long integer\n", argv[ 0 ], argv[ 2 ] ); } } else { fprintf( stderr, "%s: cannot open `%s': %s\n", argv[ 0 ], argv[ 1 ], strerror( errno ) ); } } else { fprintf( stderr, "Usage: %s path blocknr free|used|set\n", argv[ 0 ] ); } return return_code; } </nowiki> </pre> a8c6ecc1b4c45f19ce148d8544f54dde9c3fe4bc FAQ/change fs 0 44 1431 2009-06-27T01:29:01Z Chris goe 2 http://web.archive.org/web/20071005050242/www.namesys.com/change_fs.html We'd first suggest you copy everything on your regular ext2 partition to the spare partition. If the spare is smaller than your original data, compress your whole partition into a tar.gz file on the spare partition. After making sure this worked correctly and all your files are there (and/or you have a good backup!), copy the /bin, /etc, /lib and /sbin directories (plus the mkreiserfs utility) to the spare partition. This is in preparation for booting off that partition so you can reformat the original ext2 partition as reiserfs. Next, make a boot diskette with a [[ReiserFS]]-enabled kernel on it (don't forget to run <tt>lilo</tt> on the diskette!) and make sure it works (so you won't get stuck with an unbootable system.) After booting this diskette, you should get a <tt>lilo:</tt> prompt. Enter <tt>"linux root=/dev/hd init=/bin/bash</tt> at the lilo prompt. Your system should boot and stop at a bare shell prompt. At the prompt (now off of your spare partition), try <tt>tar ztvf</tt> to test the backup archive if you did the compression step above (just to make sure you can get at your data still). If you're convinced that you want to go ahead with the conversion, run [[mkreiserfs]] on your original ext2 partition, '''ERASING ALL DATA THERE''' (but you have the backup of course.) Then, mount the new partition somewhere as reiserfs and <tt>cd</tt> to the mount directory. Make sure the amount of free disk space is what you expected (just as a double check), and untar your backup archive to restore everything. At this point, your data is on ReiserFS and you should be able to rerun <tt>lilo</tt> (make sure your default kernel supports ReiserFS!) on your normal root partition to get the kernel set up again. Unmount all partitions and reboot. If everything goes as planned, Linux should say <tt>"VFS: Mounted root ... as reiserfs"</tt> at some point, and you should be all set. If the system doesn't boot, etc. you have your backup diskette and can just boot off the spare partition to fix things. Make sure you have the right ReiserFS-enabled kernel installed beforehand and all configuration files (especially <tt>lilo.conf</tt>) are up to date. This mini-howto has been written by: ------------------------------------------------------------- Matt T. Yourst Massachusetts Institute of Technology yourst@mit.edu 617.225.7690 513 French House 476 Memorial Drive - Cambridge, MA 02136 ------------------------------------------------------------- [[category:ReiserFS]] 434452ec7d91658eba1a2922fe7e4b4856ea6ab7 FAQ/halt.patch 0 49 1485 2009-06-27T05:53:23Z Chris goe 2 http://web.archive.org/web/20061113154820/www.namesys.com/halt.patch <pre> <nowiki> --- halt.orig Mon Jul 30 17:26:24 2001 +++ halt Mon Jul 30 17:26:36 2001 @@ -165,7 +165,7 @@ # Remount read only anything that's left mounted. #echo $"Remounting remaining filesystems (if any) readonly" -mount | awk '/ext2/ { print $3 }' | while read line; do +mount | awk '/ext2|reiserfs/ { print $3 }' | while read line; do mount -n -o ro,remount $line done </nowiki> </pre> [[category:ReiserFS]] a6d186e263fe10c6053715ce77f26fb89b3a27fb FAQ/potato part 0 41 1547 1423 2009-07-02T19:42:37Z Chris goe 2 debian/potato has been discontinued in 2003 I succeeded with little trouble in installing [http://www.debian.org/releases/2.2/ Debian potato] with [[ReiserFS]] as its root partition. I'll summarise the process for those who are interested. I created a boot floppy with a kernel supporting ReiserFS and a root floppy with the [[mkreiserfs|mkfs.reiserfs]] on it. I did find it useful to mount and umount each mkfs-ed partition, since it takes a good deal of time for the first install. I rebooted using my ReiserFS kernel but the potato root disk. I then installed as normal, with the following exceptions: * The Debian install procedure does not know about ReiserFS, so I had to mount the various filesystems manually. Then I picked another item from the menu. The first time this fails since the install program does not know the root partition is mounted (on target), but then it figures this out and proceeds normally. * The install procedure writes a correct /etc/fstab for all partitions except the root, to which it incorrectly gives a file type of ext2. Simply edit this before rebooting. * The install procedure installs a kernel which knows nothing about ReiserFS. I just replaced this with my kernel, and put the appropriate modules in <tt>/lib/modules</tt>. * I use GRuB instead of Debian's <tt>mbr</tt> or <tt>LILO</tt>, so I simply <tt>dpkg --delete mbr</tt> and install grub before rebooting the first time. The latest GRuB supports ReiserFS. I chose to do this because I had a machine with a hardware fault which needed frequent rebooting. This of course led to a number of corruption problems even with ReiserFS, but we eventually ran the problem down to faulty sims (thanks to memcheck-86). To avoid possible problems, I've reinstalled again, again using ReiserFS. The machine has run beautifully ever since. I've been digging through the [[mailinglists|mailing list]], and seen occasional remarks about FAQs, but no pointer to a reiserfs [[FAQ]]. Can I presume none exists at this time? -- [mailto:LeBlanc@mcc.ac.uk Dr A V Le Blanc] [[category:ReiserFS]] 35ea9fdfed7c10201fb5892a1d44c8f7b208e7e5 1423 2009-06-27T01:12:12Z Chris goe 2 http://web.archive.org/web/20070706011240/www.namesys.com/potato_part.html I succeeded with little trouble in installing Debian potato with reiserfs as its root partition. I'll summarise the process for those who are interested. I created a boot floppy with a kernel supporting reiserfs and a root floppy with the mkfs.reiserfs on it. I did find it useful to mount and umount each mkfs-ed partition, since it takes a good deal of time for the first install. I rebooted using my reiserfs kernel but the potato root disk. I then installed as normal, with the following exceptions: * The Debian install procedure does not know about reiserfs, so I had to mount the various filesystems manually. Then I picked another item from the menu. The first time this fails since the install program does not know the root partition is mounted (on target), but then it figures this out and proceeds normally. * The install procedure writes a correct /etc/fstab for all partitions except the root, to which it incorrectly gives a file type of ext2. Simply edit this before rebooting. * The install procedure installs a kernel which knows nothing about reiserfs. I just replaced this with my kernel, and put the appropriate modules in <tt>/lib/modules</tt>. * I use GRuB instead of Debian's <tt>mbr</tt> or <LILO, so I simply <tt>dpkg --delete mbr</tt> and install grub before rebooting the first time. The latest GRuB supports reiserfs. I chose to do this because I had a machine with a hardware fault which needed frequent rebooting. This of course led to a number of corruption problems even with reiserfs, but we eventually ran the problem down to faulty sims (thanks to memcheck-86). To avoid possible problems, I've reinstalled again, again using reiserfs. The machine has run beautifully ever since. I've been digging through the [[mailinglists|mailing list]], and seen occasional remarks about FAQs, but no pointer to a reiserfs [[FAQ]]. Can I presume none exists at this time? -- [mailto:LeBlanc@mcc.ac.uk Dr A V Le Blanc] [[category:ReiserFS]] 6a9be93b4ead9eb7fe99a7fffb0c3c8c3c0aa737 FAQ/rc.sysinit.patch 0 48 1483 1482 2009-06-27T05:50:38Z Chris goe 2 category added <pre> <nowiki> --- rc.sysinit.orig Mon Jul 30 22:58:45 2001 +++ rc.sysinit Mon Jul 30 22:57:16 2001 @@ -211,7 +211,8 @@ _RUN_QUOTACHECK=0 ROOTFSTYPE=`grep " / " /proc/mounts | awk '{ print $3 }'` -if [ -z "$fastboot" -a "$ROOTFSTYPE" != "nfs" ]; then +if [ -z "$fastboot" -a "$ROOTFSTYPE" != "nfs" \ + -a "$ROOTFSTYPE" != "reiserfs" ]; then STRING=$"Checking root filesystem" echo $STRING </nowiki> </pre> [[category:ReiserFS]] e885589bc5b1f104a8f3a68b54e3fa46c94610a6 1482 2009-06-27T05:50:16Z Chris goe 2 http://web.archive.org/web/20061113154800/www.namesys.com/rc.sysinit.patch <pre> <nowiki> --- rc.sysinit.orig Mon Jul 30 22:58:45 2001 +++ rc.sysinit Mon Jul 30 22:57:16 2001 @@ -211,7 +211,8 @@ _RUN_QUOTACHECK=0 ROOTFSTYPE=`grep " / " /proc/mounts | awk '{ print $3 }'` -if [ -z "$fastboot" -a "$ROOTFSTYPE" != "nfs" ]; then +if [ -z "$fastboot" -a "$ROOTFSTYPE" != "nfs" \ + -a "$ROOTFSTYPE" != "reiserfs" ]; then STRING=$"Checking root filesystem" echo $STRING </nowiki> </pre> 57fcc05850bc7e45a91cb76576193b7eb6530fc8 FAQ/small blocks 0 43 1432 1430 2009-06-27T01:30:17Z Chris goe 2 last summary entry was wrong, the correct source was: http://web.archive.org/web/20070927003333/http://www.namesys.com/small_blocks.html You need to tune the reiserfs parameters for this. '''Note: these numbers are in units of 4k blocks.''' <tt>JOURNAL_TRANS_MAX</tt> must be less than <tt>JOURNAL_BLOCK_COUNT</tt>, and not be bigger than the default (1024). Every time a transaction starts, the log needs at least <tt>JOURNAL_TRANS_MAX</tt> log blocks available, and transactions are flushed if there aren't enough log blocks ready. The default ratio is JOURNAL_BLOCK_COUNT / JOURNAL_TRANS_MAX = 8 If the ratio is 1, you more or less have synchronous updates to metdata, and things get '''very slow'''. As you try different values for <tt>JOURNAL_BLOCK_COUNT</tt>, try ratios of 2,4, and 8 for <tt>JOURNAL_TRANS_MAX</tt>. <tt>JOURNAL_MAX_BATCH</tt> controls the size of a joinable transaction. To keep overhead low, multiple transactions are combined into one before it is written the log. This number must be less than <tt>JOURNAL_TRANS_MAX</tt>. In theory, the smallest possible <tt>JOURNAL_BLOCK_COUNT</tt> or <tt>JOURNAL_TRANS_MAX</tt> size is around 48 blocks. Transactions this size will be slow, as a lot of the journal speed comes from the batching described above. You might also want to shrink <tt>RESERVED_FOR_PRESERVE_LIST</tt>. It is 500 right now, but in theory could be set to 0. It used to be space used only by the preserve lists, which no longer exist. We really need to do tests with this value at 0. Some users have done a little benchmarking and found: JOURNAL_BLOCK_COUNT 512 JOURNAL_TRANS_MAX 128 JOURNAL_MAX_BATCH 127 Works pretty well. A journal block count of 256 gave poor performance with just about every other combination of parameters. But, the performace will depend on how hard you hit the metadata. If you don't do a lot of work on small files (<16k), or if you don't do many file creations/deletions, smaller journal sizes might work well for you. [[category:ReiserFS]] c62ed9f9da5ac7879f8315319cb142e794b8db05 1430 2009-06-27T01:23:10Z Chris goe 2 http://web.archive.org/web/20071005050242/www.namesys.com/change_fs.html You need to tune the reiserfs parameters for this. '''Note: these numbers are in units of 4k blocks.''' <tt>JOURNAL_TRANS_MAX</tt> must be less than <tt>JOURNAL_BLOCK_COUNT</tt>, and not be bigger than the default (1024). Every time a transaction starts, the log needs at least <tt>JOURNAL_TRANS_MAX</tt> log blocks available, and transactions are flushed if there aren't enough log blocks ready. The default ratio is JOURNAL_BLOCK_COUNT / JOURNAL_TRANS_MAX = 8 If the ratio is 1, you more or less have synchronous updates to metdata, and things get '''very slow'''. As you try different values for <tt>JOURNAL_BLOCK_COUNT</tt>, try ratios of 2,4, and 8 for <tt>JOURNAL_TRANS_MAX</tt>. <tt>JOURNAL_MAX_BATCH</tt> controls the size of a joinable transaction. To keep overhead low, multiple transactions are combined into one before it is written the log. This number must be less than <tt>JOURNAL_TRANS_MAX</tt>. In theory, the smallest possible <tt>JOURNAL_BLOCK_COUNT</tt> or <tt>JOURNAL_TRANS_MAX</tt> size is around 48 blocks. Transactions this size will be slow, as a lot of the journal speed comes from the batching described above. You might also want to shrink <tt>RESERVED_FOR_PRESERVE_LIST</tt>. It is 500 right now, but in theory could be set to 0. It used to be space used only by the preserve lists, which no longer exist. We really need to do tests with this value at 0. Some users have done a little benchmarking and found: JOURNAL_BLOCK_COUNT 512 JOURNAL_TRANS_MAX 128 JOURNAL_MAX_BATCH 127 Works pretty well. A journal block count of 256 gave poor performance with just about every other combination of parameters. But, the performace will depend on how hard you hit the metadata. If you don't do a lot of work on small files (<16k), or if you don't do many file creations/deletions, smaller journal sizes might work well for you. [[category:ReiserFS]] 64ebfe498f01d703ea9d73c50718141f3a826fad FAQ/streams 0 40 1416 2009-06-27T00:43:16Z Chris goe 2 http://web.archive.org/web/20070706011356/www.namesys.com/stream_ans.html We would like to implement for [[ReiserFS]] the functional equivalent of streams by expanding current directory functionality. I'll wait until we are ready to perform active coding on this before sending a long email with a proposed architectural definition, but I'll highlight the main features here since this seems to be an active thread at the moment. * Inheritance of stat data. * Inheritance of filebodies with inheritance being specified by writing to some name like filename/.meta. * Objects that are both files and directories (the sticky point here is telling open whether to open it as a file or a directory. Linus seemed to have resolved this though in this thread on lkm.) The filebody is that file within the directory which is set as the default filebody for the directory by virtue of it having a link to the name .anonymous (or some such name). * Hidden entries (ala .snapshot directories on Network Appliance servers) which open()/lookup but not the default readdir() method sees. * Attributes (including stat() data) are to be such hidden entries and accessible as normal files if you choose to explicitly name them. * read() on the file directory_name_fu/.archive (or some such name) returns an archive (written in something not far from the format of the inheritance syntax) containing the directory tree rooted at directory_name_fu (and there should exist a write() plugin method for reiserfs such that cat'ing an archive file of the right syntax to directory_name_fu2/.archive causes the archive to get unpacked as directory_name_fu2.) I agree that tar is an ugly format. I'd like to encourage folks to consider Viro's valid concerns as the breaking of the egg that precedes the omelette. We should minimize breakage to the extent that we can without losing our desired functionality, but if we cease to innovate for fear of growing pains then the OS dies. I think that we can do stream functionality much better than NT or Mac by carefully analyzing what streams give people that is useful, and then implementing each such thing as a separate orthogonal feature. NTFS streams have properties that are rigidly bound to each other, and if we orthogonalize/decouple the features the expressive power goes way up. A. to F. constitutes my analysis so far. Using the above features a file with all that we value streams for in it can be implemented as: # a directory containing a .anonymous file that contains the default stream. Depending on the GUI objectives, this .anonymous file might be a symlink to .archive # All of the files in the directory are set as inheriting their stat data from the parent directory. # GUI tools learn to use the .archive feature for moving things. I think Migel is right that Windows does some things better than Unix, and we should think about what is better, and then do their good stuff better than they were able to. I think that the timing might be right for Linux to start trying to be more than Unix/NT/Mac has ever been before us. In the past it made sense to just try to catch up to them. Now 2.4 has gotten us caught up to them in the bulk of their features. He who has the dominant market share needs to lead the pack, or else the pack will lose to other packs as it mills around doing nothing. Linux dominates the Unix market nowadays. [[category:ReiserFS]] 50c859e7c13a05d2709fd155edaef528056250a3 FAQ ReiserFS 0 42 4203 3721 2016-12-25T19:41:59Z DusanC 30310 DusanC moved page [[FAQ]] to [[FAQ ReiserFS]]: Make place to create separate Reiser4 and ReiserFS FAQs <font color=red>This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks!</font> __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 are separate patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16: * [http://web.archive.org/web/20060517214944/http://ftp.suse.com/pub/people/mason/patches/data-logging/ ftp.suse.com/pub/people/mason/patches/] * [http://web.archive.org/web/20060517214944/http://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/] '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using <tt>fsync(2)</tt> calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that <tt>fsyncs</tt> introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the <tt>fsync()</tt> call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on <tt>fsync()</tt>, the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling <tt>fsync()</tt>, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using <tt>fsync()</tt> after every <tt>write()</tt> makes no guarantees. If the system crashes during either the <tt>write()</tt> or <tt>fsync()</tt> operation your data may be lost or corrupted. Suppose the <tt>fsync()</tt> does complete, does your application keep its data in multiple files? If that is the case and you need to <tt>write()</tt> to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use <tt>rename()</tt>: # Keep all of your data in a single file. # Periodically write a complete copy of your database to a temporary file. # Rename the temporary file to the original database name. Addition from Nikita Danilov: One can implement something like a ''phase-tree'' at user-level and use <tt>rename()</tt> to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery. Stop your development for now and wait until [[Reiser4]] filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems. == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 501eb411242f26043d6fe1ad465d4b6e17631128 3721 2691 2013-09-24T01:51:00Z Chris goe 2 404, to the wayback machine! <font color=red>This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks!</font> __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 are separate patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16: * [http://web.archive.org/web/20060517214944/http://ftp.suse.com/pub/people/mason/patches/data-logging/ ftp.suse.com/pub/people/mason/patches/] * [http://web.archive.org/web/20060517214944/http://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/] '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using <tt>fsync(2)</tt> calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that <tt>fsyncs</tt> introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the <tt>fsync()</tt> call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on <tt>fsync()</tt>, the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling <tt>fsync()</tt>, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using <tt>fsync()</tt> after every <tt>write()</tt> makes no guarantees. If the system crashes during either the <tt>write()</tt> or <tt>fsync()</tt> operation your data may be lost or corrupted. Suppose the <tt>fsync()</tt> does complete, does your application keep its data in multiple files? If that is the case and you need to <tt>write()</tt> to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use <tt>rename()</tt>: # Keep all of your data in a single file. # Periodically write a complete copy of your database to a temporary file. # Rename the temporary file to the original database name. Addition from Nikita Danilov: One can implement something like a ''phase-tree'' at user-level and use <tt>rename()</tt> to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery. Stop your development for now and wait until [[Reiser4]] filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems. == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 501eb411242f26043d6fe1ad465d4b6e17631128 2691 2681 2013-02-19T18:33:16Z Chris goe 2 Undo revision 2681 by [[Special:Contributions/ClaudiaBradshaw1994|ClaudiaBradshaw1994]] ([[User talk:ClaudiaBradshaw1994|talk]]) <font color=red>This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks!</font> __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 are separate patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16: * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/data-logging/ ftp.suse.com/people/mason/patches/data-logging/] * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ ftp.suse.com/people/mason/patches/intermezzo-alpha/] '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using <tt>fsync(2)</tt> calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that <tt>fsyncs</tt> introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the <tt>fsync()</tt> call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on <tt>fsync()</tt>, the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling <tt>fsync()</tt>, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using <tt>fsync()</tt> after every <tt>write()</tt> makes no guarantees. If the system crashes during either the <tt>write()</tt> or <tt>fsync()</tt> operation your data may be lost or corrupted. Suppose the <tt>fsync()</tt> does complete, does your application keep its data in multiple files? If that is the case and you need to <tt>write()</tt> to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use <tt>rename()</tt>: # Keep all of your data in a single file. # Periodically write a complete copy of your database to a temporary file. # Rename the temporary file to the original database name. Addition from Nikita Danilov: One can implement something like a ''phase-tree'' at user-level and use <tt>rename()</tt> to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery. Stop your development for now and wait until [[Reiser4]] filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems. == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 8a1b4467ae14a758b8d2224a81ac36069d2069e4 2681 1952 2013-02-19T11:47:53Z ClaudiaBradshaw1994 561 /* What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? */ <font color=red>This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks!</font> __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 are separate patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16: * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/data-logging/ ftp.suse.com/people/mason/patches/data-logging/] * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ ftp.suse.com/people/mason/patches/intermezzo-alpha/] '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. Are you busy as a bee? Do not even have time for fun? Do your friends advise you to buy [http://essaysexperts.com essay writing example]? Do not hesitate! Come after the advice of clever men and make a correct choice! However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using <tt>fsync(2)</tt> calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that <tt>fsyncs</tt> introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the <tt>fsync()</tt> call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on <tt>fsync()</tt>, the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling <tt>fsync()</tt>, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using <tt>fsync()</tt> after every <tt>write()</tt> makes no guarantees. If the system crashes during either the <tt>write()</tt> or <tt>fsync()</tt> operation your data may be lost or corrupted. Suppose the <tt>fsync()</tt> does complete, does your application keep its data in multiple files? If that is the case and you need to <tt>write()</tt> to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use <tt>rename()</tt>: # Keep all of your data in a single file. # Periodically write a complete copy of your database to a temporary file. # Rename the temporary file to the original database name. Addition from Nikita Danilov: One can implement something like a ''phase-tree'' at user-level and use <tt>rename()</tt> to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery. Stop your development for now and wait until [[Reiser4]] filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems. == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 87f76f2d957210f5a0d866855efac0fc892a4040 1952 1557 2010-10-27T22:09:48Z Chris goe 2 red! <font color=red>This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks!</font> __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 are separate patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16: * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/data-logging/ ftp.suse.com/people/mason/patches/data-logging/] * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ ftp.suse.com/people/mason/patches/intermezzo-alpha/] '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using <tt>fsync(2)</tt> calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that <tt>fsyncs</tt> introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the <tt>fsync()</tt> call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on <tt>fsync()</tt>, the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling <tt>fsync()</tt>, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using <tt>fsync()</tt> after every <tt>write()</tt> makes no guarantees. If the system crashes during either the <tt>write()</tt> or <tt>fsync()</tt> operation your data may be lost or corrupted. Suppose the <tt>fsync()</tt> does complete, does your application keep its data in multiple files? If that is the case and you need to <tt>write()</tt> to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use <tt>rename()</tt>: # Keep all of your data in a single file. # Periodically write a complete copy of your database to a temporary file. # Rename the temporary file to the original database name. Addition from Nikita Danilov: One can implement something like a ''phase-tree'' at user-level and use <tt>rename()</tt> to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery. Stop your development for now and wait until [[Reiser4]] filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems. == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 8a1b4467ae14a758b8d2224a81ac36069d2069e4 1557 1556 2009-07-03T03:32:09Z Chris goe 2 formatting fix This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 are separate patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16: * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/data-logging/ ftp.suse.com/people/mason/patches/data-logging/] * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ ftp.suse.com/people/mason/patches/intermezzo-alpha/] '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using <tt>fsync(2)</tt> calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that <tt>fsyncs</tt> introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the <tt>fsync()</tt> call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on <tt>fsync()</tt>, the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling <tt>fsync()</tt>, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using <tt>fsync()</tt> after every <tt>write()</tt> makes no guarantees. If the system crashes during either the <tt>write()</tt> or <tt>fsync()</tt> operation your data may be lost or corrupted. Suppose the <tt>fsync()</tt> does complete, does your application keep its data in multiple files? If that is the case and you need to <tt>write()</tt> to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use <tt>rename()</tt>: # Keep all of your data in a single file. # Periodically write a complete copy of your database to a temporary file. # Rename the temporary file to the original database name. Addition from Nikita Danilov: One can implement something like a ''phase-tree'' at user-level and use <tt>rename()</tt> to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery. Stop your development for now and wait until [[Reiser4]] filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems. == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 07b44242fc86257f3f74c2db23c830adfc8c2af9 1556 1489 2009-07-03T03:31:44Z Chris goe 2 I found an old ftp.suse.com mirror :) This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 are separate patches from Chris Mason that implement full data journaling for ReiserFS for Linux 2.4.16: * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/data-logging/ ftp.suse.com/people/mason/patches/data-logging/] * [http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ ftp.suse.com/people/mason/patches/intermezzo-alpha/] '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using <tt>fsync(2)</tt> calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that <tt>fsyncs</tt> introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the <tt>fsync()</tt> call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on <tt>fsync()</tt>, the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling <tt>fsync()</tt>, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using <tt>fsync()</tt> after every <tt>write()</tt> makes no guarantees. If the system crashes during either the <tt>write()</tt> or <tt>fsync()</tt> operation your data may be lost or corrupted. Suppose the <tt>fsync()</tt> does complete, does your application keep its data in multiple files? If that is the case and you need to <tt>write()</tt> to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use <tt>rename()</tt>: # Keep all of your data in a single file. # Periodically write a complete copy of your database to a temporary file. # Rename the temporary file to the original database name. Addition from Nikita Danilov: One can implement something like a ''phase-tree'' at user-level and use <tt>rename()</tt> to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery. Stop your development for now and wait until [[Reiser4]] filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems. == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 565289d80eb8295e6fa406370e0f8ef7c88e5bff 1489 1488 2009-06-27T06:22:40Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using <tt>fsync(2)</tt> calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that <tt>fsyncs</tt> introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the <tt>fsync()</tt> call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on <tt>fsync()</tt>, the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling <tt>fsync()</tt>, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using <tt>fsync()</tt> after every <tt>write()</tt> makes no guarantees. If the system crashes during either the <tt>write()</tt> or <tt>fsync()</tt> operation your data may be lost or corrupted. Suppose the <tt>fsync()</tt> does complete, does your application keep its data in multiple files? If that is the case and you need to <tt>write()</tt> to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use <tt>rename()</tt>: # Keep all of your data in a single file. # Periodically write a complete copy of your database to a temporary file. # Rename the temporary file to the original database name. Addition from Nikita Danilov: One can implement something like a ''phase-tree'' at user-level and use <tt>rename()</tt> to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery. Stop your development for now and wait until [[Reiser4]] filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems. == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 965227072bf353470986585c3cf579b7b3bdee07 1488 1487 2009-06-27T06:13:28Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable [http://gcc.gnu.org/gcc-2.96.html unreleased] version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. [http://www.redhat.com/advice/speaks_gcc.html gcc 2.96 on RedHat 7.0 was unstable], and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on [http://www.redhat.com/docs/manuals/linux/RHL-7-Manual/ RedHat 7.0]) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They [http://rhn.redhat.com/errata/RHBA-2002-055.html fixed] the compiler and thereby allowed the correctly compiled [[ReiserFS]] to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] a75fe1e9c1a67b0e30e07455470150830f590443 1487 1486 2009-06-27T06:05:36Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation ''pro'' or ''against'' any particular hard drive manufacturers for using with ReiserFS? === No, as bad hard drives are not [[ReiserFS]] specific but apply to all filesystems: There is basically no preference, general '''the faster the drive is and less seek time is better''' rule applies as always. On the other hand almost every hard drive manufacturer has a '''widely known''' broken series of hard drives. The most recent example is [http://en.wikipedia.org/wiki/Deskstar_75GXP IBM's Deskstar] series disks, especially DTLA models produced in Hungary 2000-2001. These are [http://ask.slashdot.org/article.pl?sid=01/10/04/0050238 known to fail very often], to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is [http://web.archive.org/web/20060315210819/http://www.ibmdeskstar75gxplitigation.com/ class action lawsuit against IBM] on that drives series. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 6c2597a64b76477009489ebdf89349d1bfd91653 1486 1484 2009-06-27T05:55:11Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use [http://www.linux.org/docs/ldp/howto/Cryptoloop-HOWTO/loopdevice-setup.html losetup(8)] tool to bind your entity to <tt>loop</tt> device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 227883ba6be0eba07b40ea9b1b4fe998ef6fef27 1484 1481 2009-06-27T05:52:31Z Chris goe 2 buggy compiler This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a [[#I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?|buggy compiler]]. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 0a6736e4b7f911f98aa57c8100fd22c91df705a5 1481 1480 2009-06-27T05:49:32Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount <tt>/</tt> (<tt>/dev/root</tt>) with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): * [[FAQ/rc.sysinit.patch|rc.sysinit.patch]] * [[FAQ/halt.patch|halt.patch]] Note that if you have [http://www.redhat.com/docs/manuals/linux/RHL-7.2-Manual RedHat Linux 7.2] or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 7388863026cd54c29f2eacaf86f0947bc4ce940f 1480 1473 2009-06-27T05:45:54Z Chris goe 2 sync/sync? This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === [mailto:bgraveland@hyperchip.com Brent Graveland] answers: It's the [http://kerneltrap.org/node/14148 atime] update. Every time you run <tt>sync(1)</tt>, the sync program's <tt>atime</tt> is updated. The next <tt>sync()</tt> writes this <tt>atime</tt> update, then <tt>sync(1)</tt> gets updated again. === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] ee1e5e61d7cdc97d02733e3f350bf1a297655dbb 1473 1472 2009-06-27T05:36:18Z Chris goe 2 . This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] c969a82b0bdd655d2ff013c34f4bae61173d89b5 1472 1471 2009-06-27T05:35:00Z Chris goe 2 -> sig11 This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#Why_do_I_get_a_signal_11_when_compiling_the_kernel_using_ReiserFS_and_not_ext2?|the signal 11 question]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 9ab3f3b4902734d6b456f4790ee398225d17a334 1471 1461 2009-06-27T05:33:31Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === [mailto:woster73@yahoo.com William Oster] answers: If you are using a motherboard with a [http://www.via.com.tw/en/products/apollo/mvp3.jsp VIA MVP3] chipset, you may have [[ReiserFS]] problems caused by the way your kernel is configured for the so called [http://lxr.linux.no/linux+v2.6.30/drivers/pci/quirks.c PCI quirks]. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and [http://lxr.linux.no/linux+v2.6.30/Documentation/scsi/ncr53c8xx.txt NCR 53c8xx SCSI]. But please note: It probably affects '''any controller which uses DMA and PCI bus mastering'''. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, '''DO NOT''' enable the <tt>CONFIG_PCI_QUIRKS</tt> configuration option in the <tt>/usr/src/linux/.config</tt> file. Although the Linux documentation suggests that this option can be enabled if in doubt, '''DO NOT''' enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your <tt>/usr/src/linux/.config</tt> file. You are safe from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then <tt>make dep</tt>, rebuild, and reinstall. === I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? === You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See [[#ReiserFS running 3C hotter than ext2]]). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, [[#I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems|VIA chipset with PCI quirks turned on]], or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/ Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 6df954c780a3904447b5128a4aa7e61c31e23a72 1461 1459 2009-06-27T04:05:40Z Chris goe 2 dead link This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] (dead) that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 49bb8492b4b26630a1ef50f17f144df6250b50a9 1459 1451 2009-06-27T03:59:44Z Chris goe 2 reiserfsprogs split This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 8cf6ca8f266e6271f1b85cd437ec5ec9fcf76d76 1451 1450 2009-06-27T03:47:17Z Chris goe 2 -> bad-block-handling This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === This is covered [[FAQ/bad-block-handling|here]]. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 57c9616b7e902a1d1a5e4d3ddc274c537d64b597 1450 1449 2009-06-27T03:46:10Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running <tt>/sbin/shutdown</tt>? Does it mean there is no risk of data loss? === No, definitely not. As of now, [[ReiserFS]] only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only [[Reiser4]] guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) [[ReiserFS]] does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is [ftp://ftp.suse.com/pub/people/mason/patches/data-logging separate implementation of data logging] that will [http://marc.info/?l=reiserfs-devel&m=103472026011689&w=2 soon] go into the mainstream kernel. === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 77ca4a7ec9de8e7d68fa55828c095599d73caece 1449 1448 2009-06-27T03:42:35Z Chris goe 2 links added This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === [http://archives.neohapsis.com/archives/postfix/2001-03/1148.html Craig Sanders answered] in detail: By the time I got around to running <tt>bonnie</tt>, the <tt>postmark</tt> and <tt>postal</tt> benchmarks had convinced me that <tt>notail</tt> was essential. Host system: * Debian GNU/Linux (of course :) * Linux kernel 2.4.2 with latest 20010305 ReiserFS patch * dual P3-866 (256K cache) * 512MB RAM * [http://www.adaptec.com/en-US/support/scsi/u160/ASC-19160/ Adaptec 19160] SCSI Controller External drive box: * [http://www.domex.com.tw/support/product/8230u.htm Domex 8230u] RAID controller, 32MB battery-backed cache. * 6 x 18GB IBM [http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A78005A3610/$file/ddys_sp21.PDF DDYS-T18350M] drives For this particular hardware I was using, [[ReiserFS]]/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O. === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 7ae7ab1d860ae5eee0f65e3cecb1f1ebfb371e0d 1448 1447 2009-06-27T03:34:19Z Chris goe 2 links added This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create [[ReiserFS]] on top of [http://sourceware.org/lvm2/ LVM] logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 5bfa562fcf8845a130dcef064b3dc55ea6bf46bf 1447 1446 2009-06-27T03:32:48Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use <tt>dump(8)</tt> and <tt>restore(8)</tt> with ReiserFS? Any caveats? === No. <tt>dump(8)</tt> uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with [[ReiserFS]]. To back up ReiserFS files use <tt>tar(1)</tt>, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that <tt>dump(8)</tt> is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to <tt>dump(8)</tt> is speed. Unfortunately, because it shares the same name as Unix <tt>tar(1)</tt>, people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] dc8a1cf86b36ddc7bea8e9b5100663a44e38b1ca 1446 1445 2009-06-27T01:50:27Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted? === This is done with [[reiserfsck]]. === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] e34b3cc2ffcf6f60ec22cbb7608563ffaa11e2d5 1445 1444 2009-06-27T01:49:48Z Chris goe 2 mod_speling This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because [[ReiserFS]] provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] eed3cede687ff545090824be0e775afcc89a4ef6 1444 1443 2009-06-27T01:48:55Z Chris goe 2 0 0 This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === You'd put in <tt>"0 0"</tt>, e.g. /dev/sda3 /var reiserfs notail,nodev,nosuid,noexec <font color="red">0 0</font> === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 56bba26ebf894192f79d0c310320ebe268bdd053 1443 1442 2009-06-27T01:47:23Z Chris goe 2 -> resize_reiserfs This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === This can be done with [[resize_reiserfs]]. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] fcb28ae6565df4c5ad47c8e66639c41b4281ede4 1442 1441 2009-06-27T01:46:22Z Chris goe 2 -> http://www.win.tue.nl/~aeb/partitions/partition_types.html This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === [http://www.win.tue.nl/~aeb/partitions/partition_types.html Linux native filesystem] (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 6cf2b56c4bc808653470d4b1092e91bebe82f0a0 1441 1440 2009-06-27T01:44:11Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === [http://brzitwa.de/mb/gpart/ gpart] is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] ee25f9f63301fa03f9929502b88b0444c4e2d271 1440 1439 2009-06-27T01:43:08Z Chris goe 2 -> http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux [http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.13 kernel 2.4.13], ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] ebcd816a6489afaa787a00782f676b26d3856338 1439 1438 2009-06-27T01:42:02Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and [[ReiserFS]] contain random differences, and overheating and bad RAM have random sensitivities. ([http://www.bitwizard.nl/sig11/ Signal 11] is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 80c0b3e646d202482f46f183b3c040068088efad 1438 1437 2009-06-27T01:40:39Z Chris goe 2 http://www.bitwizard.nl/sig11/ added This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating and/or you have [http://www.bitwizard.nl/sig11/ bad RAM]. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU. :-) Seriously, ext2 and ReiserFS contain random differences, and overheating and bad RAM have random sensitivities. (Signal 11 is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked ...) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 39706a7b2fa8cfdadc9af916c1e279c534289b4b 1437 1436 2009-06-27T01:39:50Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. === <tt>du(1)</tt> says ReiserFS makes space efficiency worse. === Use <tt>df(1)</tt> not <tt>du(1)</tt>, or use ''raw'' option for <tt>du(1)</tt> if it's supported. <tt>st_blocks</tt> summed up is less accurate than <tt>st_size</tt> for [[ReiserFS]] because we pack tails, and <tt>st_blocks</tt> rounds numbers up. === <tt>mkreiserfs(8)</tt> fails after repartitioning === The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that. === Performance is poor, and my disk at 96% full still has free space. === Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. [[ReiserFS]], because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or <tt>tar</tt> it from) a reiserfs partition so that files are created in reiserfs <tt>readdir()</tt> order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating or you have bad RAM. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU. :-) Seriously, ext2 and ReiserFS contain random differences, and overheating and bad RAM have random sensitivities. (Signal 11 is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked ...) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 2fe31980fbb8cfa15f195d2f2880ec079ee25e92 1436 1435 2009-06-27T01:35:50Z Chris goe 2 mod_speling This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml special instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. du says ReiserFS makes space efficiency worse. Use df not du, or use "raw" option for du if your du supports that. st_blocks summed up is less accurate than st_size for [[ReiserFS]] because we pack tails, and st_blocks rounds numbers up. [[mkreiserfs]] fails after repartitioning. The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that.... Performance is poor, and my disk at 96% full still has free space. Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. ReiserFS, because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used.. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or tar it from) a reiserfs partition so that files are created in reiserfs readdir order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating or you have bad RAM. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU. :-) Seriously, ext2 and ReiserFS contain random differences, and overheating and bad RAM have random sensitivities. (Signal 11 is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked ...) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] efa0b0e206f696a94bd3dfc34c54c4da69d8107d 1435 1434 2009-06-27T01:35:16Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all RAID levels using any Linux >= 2.4.1, but '''DO NOT''' use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is '''not''' safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with [[ReiserFS]] in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml speciall instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. du says ReiserFS makes space efficiency worse. Use df not du, or use "raw" option for du if your du supports that. st_blocks summed up is less accurate than st_size for [[ReiserFS]] because we pack tails, and st_blocks rounds numbers up. [[mkreiserfs]] fails after repartitioning. The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that.... Performance is poor, and my disk at 96% full still has free space. Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. ReiserFS, because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used.. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or tar it from) a reiserfs partition so that files are created in reiserfs readdir order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating or you have bad RAM. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU. :-) Seriously, ext2 and ReiserFS contain random differences, and overheating and bad RAM have random sensitivities. (Signal 11 is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked ...) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 46a1f9199de22002e7474ca885f5212c1ab036fe 1434 1433 2009-06-27T01:33:34Z Chris goe 2 formatting fixes This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. [[ReiserFS]]-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the <tt>"-o conv</tt> [[mount|mount option]]. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all Raid levels using any Linux >= 2.4.1, but DO NOT use Raid5 with Linux 2.2.x. Our journaling and their Raid code step on each other in the buffering code. Also, mirroring is not safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only Raid level that is safe with ReiserFS in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml speciall instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. du says ReiserFS makes space efficiency worse. Use df not du, or use "raw" option for du if your du supports that. st_blocks summed up is less accurate than st_size for [[ReiserFS]] because we pack tails, and st_blocks rounds numbers up. [[mkreiserfs]] fails after repartitioning. The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that.... Performance is poor, and my disk at 96% full still has free space. Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. ReiserFS, because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used.. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or tar it from) a reiserfs partition so that files are created in reiserfs readdir order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating or you have bad RAM. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU. :-) Seriously, ext2 and ReiserFS contain random differences, and overheating and bad RAM have random sensitivities. (Signal 11 is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked ...) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] cb635c2b77db9511b52a8282b33220eddbd6b584 1433 1426 2009-06-27T01:32:20Z Chris goe 2 lilo? This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. ReiserFS-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the "-o conv" mount option. === The ReiserFS module doesn't insert properly - why? === After applying the patch, ''recompile'' the whole kernel including the modules target, reboot, then try to insert the module. === Can I use ReiserFS with the software RAID? === Yes, for all Raid levels using any Linux >= 2.4.1, but DO NOT use Raid5 with Linux 2.2.x. Our journaling and their Raid code step on each other in the buffering code. Also, mirroring is not safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only Raid level that is safe with ReiserFS in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml speciall instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. du says ReiserFS makes space efficiency worse. Use df not du, or use "raw" option for du if your du supports that. st_blocks summed up is less accurate than st_size for [[ReiserFS]] because we pack tails, and st_blocks rounds numbers up. [[mkreiserfs]] fails after repartitioning. The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that.... Performance is poor, and my disk at 96% full still has free space. Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. ReiserFS, because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used.. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or tar it from) a reiserfs partition so that files are created in reiserfs readdir order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating or you have bad RAM. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU. :-) Seriously, ext2 and ReiserFS contain random differences, and overheating and bad RAM have random sensitivities. (Signal 11 is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked ...) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 431d14d591521f7ffa0b2cb7356b2774bb03b085 1426 1424 2009-06-27T01:17:00Z Chris goe 2 from: Frequently_Asked_Questions This FAQ is very [[ReiserFS]] centric and often a bit dated. The [[Reiser4]] filesystem is mentioned as ''upcoming''. Be sure to search the [[mailinglists|mailing list archives]] and help update this FAQ - Thanks! __TOC__ === What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.? === Specifications for [[ReiserFS]]: {|cellpadding="5" cellspacing="0" border="1" | '''property''' || '''3.5''' || '''3.6''' |- | max number of files || 232-3 => 4 Gi - 3 || 232-3 => 4 Gi-3 |- | max number files a dir can have || 518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) || 232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions) |- | max file size || 231-1 => 2 Gi-1 || 260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int |- | max number links to a file || 216 => 64 Ki || 232 => 4 Gi |- | max filesystem size || 232 (4K) blocks => 16 Ti || 232 (4K) blocks => 16 Ti |} ReiserFS does '''meta-data journaling''', enabling fast crash recovery without the expense of full '''data journaling'''. There [ftp://ftp.suse.com/pub/people/mason/patches/intermezzo-alpha/ were] separate [http://marc.info/?l=reiserfs-devel&m=100895310422415&w=2 patches from Chris Mason] that implement full data journaling for ReiserFS for Linux 2.4.16. '''Note''': Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming [[Reiser4]]. Such an API will be exported to userspace and all programs that need transactions will be able to use it. === Mount fails after reiserfsck --rebuild-tree failure === When [[reiserfsck]] --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if [[reiserfsck]] will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once [[reiserfsck]] --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest [[Reiser4progs|reiserfsprogs]] package installed. If that fails, please send a bug report to our [[mailinglists|mailing list]] and be ready to answer our questions. === Why is the execution time for a <tt>find . -type f | xargs cat {} \;</tt> command much longer when using ReiserFS than for the same command when using ext2? === This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the <tt>readdir()</tt> operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use <tt>cp -a</tt>), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to <tt>tar(1)</tt> up the whole partition, clear it, and then untar the previously saved data. === Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS? === No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at [ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/ at SuSE] (gone) by Chris Mason, they are still [http://gd.tuwien.ac.at/utils/fs/reiserfs/quota-patches/ mirrored at TU-Wien]. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie [http://www.suse.com SuSE]) do ship reiserfs-quota enabled kernels, though. === I am getting some errors in my kernel logs, that I do not know how to interpret === Messages like: vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of [1718696 1718710 0x0 SD]" zam-7001: io error in reiserfs_find_entry most likely accompanied with samples below are definite signs of harddisk problems (bad sectors): hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945, sector=4286584 end_request: I/O error, dev 03:03 (hda), sector 4286584 or scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00 Current sd 08:00: sense key Medium Error or I/O error: dev 08:21, sector 65704 Messages about <tt>"access beyond end of device"</tt> may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions. The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } hda: dma_intr: error=0x84 { DriveStatusError BadCRC } If you see any message from [[ReiserFS]] that you cannot interpret and there is nothing similar to messages above around it, [[mailinglists|mail the message to us]] and we will explain it to you. === Will ReiserFS implement streams, extended attributes, etc.? === [[FAQ/streams|Here]] is the one page answer. === Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an <tt>ls(1)</tt> in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed? === First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels [http://gd.tuwien.ac.at/utils/fs/reiserfs/reiserfs-for-2.5/2.5.4.pending/07-reiserfs-bitmap-journal-read-ahead.diff here]. Also RAID drivers have '''minimal guaranteed''' and '''maximal possible''' RAID rebuild bandwidth usage. These valueas are controlled through <tt>/proc/sys/dev/raid/speed_limit_min</tt> and <tt>/proc/sys/dev/raid/speed_limit_max</tt> sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at <tt>speed_limit_max</tt> speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed). === I get attempt to read past the end of the partition error messages; is ReiserFS corrupted? === You changed your partition sizes, and then before rebooting ran [[mkreiserfs]]. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). [[mkreiserfs]] created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages. === Can I use VMware with ReiserFS? === VMware was tested on [http://www.suse.com/ SuSE Linux] with [http://support.microsoft.com/gp/lifean18 Windows98] Guest OS on a [[ReiserFS]] partition. There's one trick at the beginning: the following line was added to the VMware config file host.FSSupportLocking1 = 0x52654973 # (0x52654973 == *(u32 *) "ReIs") Thanks to [mailto:gkade@bigbrother.net Gregory K. Ade] for this hint. === How do I install Debian potato with ReiserFS as root partition? === [[FAQ/potato_part|Here]] are instructions by [mailto:LeBlanc@mcc.ac.uk Dr. A.V. Le Blanc]. === Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why? === Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount: Filesystem on xx:yy cannot be mounted because it is bigger than the device you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to If you do not use LVM, that usually means you need to run <tt>[[reiserfsck]] --rebuild-sb</tt> on your filesystem and agree to change its default size to proposed one. === Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device? === [[FAQ/small_blocks|Here]] are instructions. === How do I change root from ext2 to ReiserFS without loss of data? === [[FAQ/change_fs|Here]] are instructions. === <tt>mount: /dev/hda5 has wrong major or minor number</tt> - what does that mean? === The kernel does not know anything about [[ReiserFS]], it is neither compiled in nor available as a module. === Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS? === Yes. ReiserFS-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the "-o conv" mount option. === The ReiserFS module doesn't insert properly === After applying the patch, recompile the whole kernel including the modules target, reboot, then try to insert the module. Also, be sure that you run LILO. === Can I use ReiserFS with the software RAID? === Yes, for all Raid levels using any Linux >= 2.4.1, but DO NOT use Raid5 with Linux 2.2.x. Our journaling and their Raid code step on each other in the buffering code. Also, mirroring is not safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only Raid level that is safe with ReiserFS in the 2.2.x kernels is the striping/concatenation level. === Can I use ReiserFS with 3ware RAID? === Yes, but you need to use Linux 2.2.19 or later for reasons other than [[ReiserFS]]. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In [http://web.archive.org/web/20030415160519/http://www.3ware.com/support/raid5techbulletin.shtml speciall instructions]. (archive.org) === Why do things freeze on my IDE hard drive for annoying amounts of time? === Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this. du says ReiserFS makes space efficiency worse. Use df not du, or use "raw" option for du if your du supports that. st_blocks summed up is less accurate than st_size for [[ReiserFS]] because we pack tails, and st_blocks rounds numbers up. [[mkreiserfs]] fails after repartitioning. The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that.... Performance is poor, and my disk at 96% full still has free space. Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. ReiserFS, because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used.. If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or tar it from) a reiserfs partition so that files are created in reiserfs readdir order as this will improve performance. === Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2? === Your CPU is overheating or you have bad RAM. === But it doesn't happen with ext2? === ext2 uses less heat sensitive gates in the CPU. :-) Seriously, ext2 and ReiserFS contain random differences, and overheating and bad RAM have random sensitivities. (Signal 11 is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked ...) === Can I use ReiserFS on other architectures than i386? === Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch. === I need a program which will help me in rebuilding/recreating my partition table. === gpart ( http://brzitwa.de/mb/gpart/index.html) is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS; it prints a proposed contents for the primary partition table and is well-documented. === What partition type should I use for ReiserFS? === Linux native filesystem (83) === Can I use 32GB+ IDE Hard Drives with ReiserFS? === Yes if you use Linux kernel 2.4 and up. === What about resizing ReiserFS? === Please follow this link. === What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems? === 0 0 === Why are ReiserFS filesystems not fscked on reboot after a crash? === Because ReiserFS provides journalling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log. === Can I interactively repair a filesystem that was corrupted (due to an internal bug in the kernel or a to hardware fault)? === man [[reiserfsck]] === Can I use "dump" and "restore" with ReiserFS? Any caveats? === No. dump uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS. To back up ReiserFS files use tar, which is universal and can be applied to almost any reasonable Linux filesystem. It is well known among system administrators that dump is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of Gnu tar, which is quite complete. Basically, the only real disadvantage of Gnu tar compared to dump is speed. Unfortunately, because it shares the same name as unix tar, people are reluctant to believe this. (Yes, the Gnu version has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30k, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups. === Does ReiserFS support snapshots? === No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities. === Can I check reiserfs filesystems for errors without unmounting them? === [[reiserfsck]] in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger: === What ReiserFS mount options should I use to get the performance winner for a mail server? === Craig Sanders answered in detail: "By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential. host system: - Debian GNU/Linux (of course :) - Linux kernel 2.4.2 with latest 20010305 ReiserFS patch - dual P3-866 (256K cache) - 512MB RAM - Adaptec 19160 SCSI Controller external drive box: - Domex 8230u RAID controller, 32MB battery-backed cache. - 6 x 18GB IBM DDYS-T18350M drives for this particular hardware I was using, reiserfs/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O." === Does using ReiserFS mean I can just press the power off button without running "shutdown" or "init 0," etc? Does it mean there is no risk of data loss? === No, definitely not. As of now, ReiserFS only provides meta-data journaling--that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed. Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.) ReiserFS V3 does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning. However, there is separate implementation of data logging that will soon go into the mainstream kernel. You should be able to get it from ftp.suse.com/pub/people/mason/patches/data-logging === How does ReiserFS support bad block handling? === See here. === I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems. === William Oster <woster73@yahoo.com> answers: If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called "pci quirks". My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium). I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering. Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration. If you fit this profile, DO NOT enable the "pci quirks" configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors. Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are SAFE from this problem if you find this line: # CONFIG_PCI_QUIRKS is not set Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "pci quirks" option, then make dep, rebuild, and reinstall. I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable? You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See FAQ question about ReiserFS running 3C hotter than ext2.) Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs. ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/..... Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware. === How can I put a label (like allowed by <tt>-L</tt> option of <tt>mkfs.ext2</tt>) on a ReiserFS instance? === Currently, this feature is only implemented for [[ReiserFS]] v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent [[Reiser4progs|reiserfsprogs]] package on [[ReiserFS]] v3.6 filesystem using <tt>-l</tt> switch (<tt>-u</tt> for UUID) to [[reiserfstune]] (for existing partitions) or to [[mkreiserfs]] (for partitions being created) commands. Support for labels and UUIDs was integrated into [[Reiser4progs|reiserfsprogs]] starting from version 3.x.1a. === Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything. === Brent Graveland <bgraveland@hyperchip.com> answers: It's the atime update. Every time you run sync, the sync program's atime is updated. The next sync writes this atime update, then sync gets updated again... === RedHat does not unmount / with ReiserFS on halt. How to fix it? === RedHat users kindly provided these patches (not tested by us): rc.sysinit.patch and halt.patch. Note that if you have RedHat Linux 7.2 or later, you do not need these patches. === How do I run programs from reiserfsprogs package on encrypted devices? === In order to access such encrypted entities you need to use losetup tool to bind your entity to loop device. === Are there any recomendation pro or against any particular hard drive manufacturers for using with reiserfs? === There is basically no preference, general "the faster the drive is and less seek time is better" rule applies as always. On the other hand almost every hard drive manufacturer has a "widely known" broken series of hard drives. The most recent example is IBM's "Deskstar" series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series which is in progress. === I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it? === Use the most recent version of RedHat (gcc Linux 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS). The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken-fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work. === In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance? === Answer from Chris Mason: The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync call. So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk: write(file1) write(file2) write(file3) fsync(file1) fsync(file2) fsync(file3) Is much faster than: write(file1) fsync(file1) write(file2) fsync(file2) write(file3) fsync(file3) It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync, the faster overall things will run. write(8k) fsync(file) is much faster than: write(4k) fsync(file) write(4k) fsync(file) Trying to optimize for those 3 things alone can make a huge performance difference overall. Answer from Josh MacDonald: You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write or fsync operation your data may be lost or corrupted. Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems. The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename(): 1. Keep all of your data in a single file. 2. Periodically write a complete copy of your database to a temporary file. 3. Rename the temporary file to the original database name. (Addition from Nikita Danilov: One can implement something like a phase-tree at user-level and use rename to atomically switch root of the tree. This overcomes "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.) Answer from Nikita Danilov: Stop your development for now and wait until reiser4 filesystem will be released, that have transaction API exported to the userspace. That transaction API would solve all of your problems == Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects? == Traditional file systems typically have poor performance when there are many files in a single directory, but not [[ReiserFS]]. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files. ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality. The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. [[Reiser4]] will improve upon this restriction, although it is still under development. ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time. To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking. [[category:ReiserFS]] [[category:Reiser4]] 1ae6ad87714e93517c659bf285bc73eac279ac0b 1424 2009-06-27T01:12:36Z Chris goe 2 -> Frequently_Asked_Questions #REDIRECT [[Frequently_Asked_Questions]] f8bae25b14df0c597140c31a21593a1a322d8e86 Filesystem Testing Tools 0 76 4305 2531 2018-10-31T17:06:14Z Edward 4 Fixed error facti: Chris Mason didn't write stress.sh, it has copyrights in the header = stress.sh = [http://oss.oracle.com/~mason/stress.sh stress.sh] is a generic filesystem testing [http://www.spinics.net/lists/reiserfs-devel/msg01812.html tool]. [[category:ReiserFS]] [[category:Reiser4]] 518daf9f353cb4e1ad2022cc60dd592bfbc9008d 2531 2232 2012-09-25T17:37:51Z Chris goe 2 -> = stress.sh = [http://oss.oracle.com/~mason/stress.sh stress.sh] is a generic filesystem testing tool, written by [http://www.spinics.net/lists/reiserfs-devel/msg01812.html Chris Mason]. [[category:ReiserFS]] [[category:Reiser4]] d3b2bc47d6ecb27612302550ae6b0212233ee094 2232 1623 2011-04-04T17:47:12Z Chris goe 2 categories added === stress.sh === [http://oss.oracle.com/~mason/stress.sh stress.sh] is a generic filesystem testing tool, written by [http://www.spinics.net/lists/reiserfs-devel/msg01812.html Chris Mason]. [[category:ReiserFS]] [[category:Reiser4]] cb5048231252a40a92888f7ec21a8296f69cab6f 1623 2009-08-31T06:48:47Z Chris goe 2 Created page with '=== stress.sh === A more general filesystem testing tool really, not specific to ReiserFS or Reiser4 is [http://oss.oracle.com/~mason/stress.sh stress.sh], by [http://www.spinic...' === stress.sh === A more general filesystem testing tool really, not specific to ReiserFS or Reiser4 is [http://oss.oracle.com/~mason/stress.sh stress.sh], by [http://www.spinics.net/lists/reiserfs-devel/msg01812.html Chris Mason]. 5da4276b41c00c02237a99d9f54adf7742c313d1 Fsck.reiser4 0 87 1682 1679 2010-02-10T11:33:53Z Chris goe 2 category added === NAME === fsck.reiser4 - the program for checking and repairing reiser4 filesystem === SYNOPSIS === fsck.reiser4 [ options ] FILE === DESCRIPTION === fsck.reiser4 is reiser4 filesystem check and repair program. === CHECK OPTIONS === --check the default action checks the consistency and reports, but does not repair any corruption that it finds. This option may be used on a read-only file system mount. --fix fixes minor corruptions that do not require rebuilding; sets up correct values of bytes unsupported by kernel in the case of transparent compression. --build-sb fixes all severe corruptions in super blocks, rebuilds super blocks from scratch if needed. --build-fs fixes all severe fs corruptions, except super block ones; rebuilds reiser4 filesystem from the scratch if needed. -L, --logfile forces fsck to report any corruption it finds to the specified logfile rather then on stderr. -n, --no-log prevents fsck from reporting any kind of corruption. -a, --auto automatically checks the file system without any questions. -q, --quiet supresses gauges. -r ignored. === PLUGIN OPTIONS === --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 know about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === COMMON OPTIONS === -V, --version prints program version -?, -h, --help prints program help -q, --quiet suppress messages. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces fsck to use whole disk, not block device or mounted partition. -p, --preen automatically repair minor corruptions on the filesystem. -c, --cache N tunes number of nodes in the libreiser4 tree buffer cache === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[debugfs.reiser4|debugfs.reiser4(8)]] * [[mkfs.reiser4|mkfs.reiser4(8)]] * [[measurefs.reiser4|measurefs.reiser4(8)]] === AUTHOR === This manual page was written by Vitaly Fertman <vitaly@namesys.com> [[category:Reiser4]] cef1bfd90a33197efb35355dea2b0dd4c5961b98 1679 2010-02-10T11:14:08Z Chris goe 2 Created page with '=== NAME === fsck.reiser4 - the program for checking and repairing reiser4 filesystem === SYNOPSIS === fsck.reiser4 [ options ] FILE === DESCRIPTION === fsck.reiser…' === NAME === fsck.reiser4 - the program for checking and repairing reiser4 filesystem === SYNOPSIS === fsck.reiser4 [ options ] FILE === DESCRIPTION === fsck.reiser4 is reiser4 filesystem check and repair program. === CHECK OPTIONS === --check the default action checks the consistency and reports, but does not repair any corruption that it finds. This option may be used on a read-only file system mount. --fix fixes minor corruptions that do not require rebuilding; sets up correct values of bytes unsupported by kernel in the case of transparent compression. --build-sb fixes all severe corruptions in super blocks, rebuilds super blocks from scratch if needed. --build-fs fixes all severe fs corruptions, except super block ones; rebuilds reiser4 filesystem from the scratch if needed. -L, --logfile forces fsck to report any corruption it finds to the specified logfile rather then on stderr. -n, --no-log prevents fsck from reporting any kind of corruption. -a, --auto automatically checks the file system without any questions. -q, --quiet supresses gauges. -r ignored. === PLUGIN OPTIONS === --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 know about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === COMMON OPTIONS === -V, --version prints program version -?, -h, --help prints program help -q, --quiet suppress messages. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces fsck to use whole disk, not block device or mounted partition. -p, --preen automatically repair minor corruptions on the filesystem. -c, --cache N tunes number of nodes in the libreiser4 tree buffer cache === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[debugfs.reiser4|debugfs.reiser4(8)]] * [[mkfs.reiser4|mkfs.reiser4(8)]] * [[measurefs.reiser4|measurefs.reiser4(8)]] === AUTHOR === This manual page was written by Vitaly Fertman <vitaly@namesys.com> 2dd0c0583138a3a7cd652f8dde3b3ced278254e9 Future Vision 0 32 4159 2092 2016-08-25T22:00:43Z Chris goe 2 [[containers]] linked {{wayback|http://www.namesys.com/whitepaper.html|2006-11-13}} Futurue Vision of ReiserFS Name Spaces As Tools for Integrating the Operating System Rather Than As Ends in Themselves By Hans Reiser http://namesys.com 6114 La Salle ave., #405 Oakland, CA 94611 email: reiser@namesys.com == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: <pre> [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C </pre> A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], [[containers|container]], consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, [[Txn-doc|Reiser4 Transaction Design Document]] and [[Reiser4|Reiser4 Whitepaper]]. == References == * 1. Blair, David C. and Marron, M. E. [http://portal.acm.org/citation.cfm?doid=3166.3197 Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System] Communications of the ACM v 28 n 3 Mar 1985 p289-299 * 2. Codd, E. F. [http://portal.acm.org/citation.cfm?id=77708 The Relational Model for Database Management: version 2] c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. * 3. Date, C.J. [http://portal.acm.org/citation.cfm?id=4198 An Introduction to Database Systems], 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. * 4. Curtis, Ronald and Larry Wittie [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=35714&arnumber=1695185 Global Naming in Distributed Systems] IEEE Software July 1984 p76-80 * 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, [http://portal.acm.org/citation.cfm?id=42372.42378 Computing with Structured Connectionist Networks] Communications of the ACM, v31 Feb '88, p170(18) * 6. Fox, E. A., and Wu, H. [http://portal.acm.org/citation.cfm?id=358466 Extended Boolean Information Retrieval], Communications of the ACM, 26, 1983, pp. 1022-1036 * 7. Gallant, Stephen I., [http://portal.acm.org/citation.cfm?id=42377 Connectionist Expert Systems], Communications of the ACM, v31 Feb '88, pl52(18) * 8. Gates, Bill. Comdex '91 speech on [http://findarticles.com/p/articles/mi_m0REL/is_n11_v90/ai_9715919/ Information at Your Fingertips] available for $8 on videotape from Microsoft's sales department. * 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.4726 Semantic File Systems], Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. * 10. Gilula, Mikhail. [http://portal.acm.org/citation.cfm?id=174888 The Set Model for Database and Information Systems], 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. * 11. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.4527 Joint Object Services Submission] (JOSS), OMG TC Document 93.5.1 * 12. Marchionini, Gary., and Shneiderman, Ben. [http://portal.acm.org/citation.cfm?id=619765 Finding Facts vs. Browsing Knowledge in Hypertext Systems] Computer, January 1988, p. 70 * 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 * 14. Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen [http://domino.watson.ibm.com/library/cyberdig.nsf/a3807c5b4823c53f85256561006324be/1e2deed787c18fbc85256593006f843c?OpenDocument Rufus: The Information Sponge] Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center * 15. Metzler and Haas. [http://portal.acm.org/citation.cfm?id=65943.65949 The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval], Proceedings of the ACM SIGIR Conference, 1989, ACM Press, * 16. Nelson, T.H. [http://www.eastgate.com/catalog/LiteraryMachines.html Literary Machines], self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. * 17. Mozer, Nfichael C. [http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?accno=ED245694 Inductive Information Retrieval Using Parallel Distributed Computation], UCLA * 18. Pike, Rob and P.J. Weinberger ... The Hideous Name "AT&T Research Report" * 19. Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9]. Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. * 20. Potter, Walter D. and Robert P. Trueblood, [http://portal.acm.org/citation.cfm?id=45937 Traditional, semantic, and hyper-semantic approaches to data modeling] v21 Computer '88 p53(1 1) * 21. Rijsbergen, C. J. Van, [http://www.dcs.gla.ac.uk/Keith/Preface.html Information Retrieval] - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge * 22. Salton, G. (1986) [http://portal.acm.org/citation.cfm?id=6149 Another Look At Automatic Text-Retrieval Systems], Communications of the ACM, 29, 648-656 * 23. Smith, J.M. and D.C. Smith, [http://portal.acm.org/citation.cfm?id=320546 Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems], June 1977, pp. 105-133 ICS Report No. 8406 June 1984 * 24. [http://www.win.tue.nl/~aeb/partitions/partition_types.html Partition types] by [mailto:aeb@cwi.nl Andries Brouwer], 2009-06-25 [[category:Reiser4]] 1a1566911c60b4bd921d4e5e7d4134df31f2a14a 2092 1742 2010-10-27T22:50:56Z Chris goe 2 formatting fix {{wayback|http://www.namesys.com/whitepaper.html|2006-11-13}} Futurue Vision of ReiserFS Name Spaces As Tools for Integrating the Operating System Rather Than As Ends in Themselves By Hans Reiser http://namesys.com 6114 La Salle ave., #405 Oakland, CA 94611 email: reiser@namesys.com == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: <pre> [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C </pre> A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, [[Txn-doc|Reiser4 Transaction Design Document]] and [[Reiser4|Reiser4 Whitepaper]]. == References == * 1. Blair, David C. and Marron, M. E. [http://portal.acm.org/citation.cfm?doid=3166.3197 Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System] Communications of the ACM v 28 n 3 Mar 1985 p289-299 * 2. Codd, E. F. [http://portal.acm.org/citation.cfm?id=77708 The Relational Model for Database Management: version 2] c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. * 3. Date, C.J. [http://portal.acm.org/citation.cfm?id=4198 An Introduction to Database Systems], 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. * 4. Curtis, Ronald and Larry Wittie [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=35714&arnumber=1695185 Global Naming in Distributed Systems] IEEE Software July 1984 p76-80 * 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, [http://portal.acm.org/citation.cfm?id=42372.42378 Computing with Structured Connectionist Networks] Communications of the ACM, v31 Feb '88, p170(18) * 6. Fox, E. A., and Wu, H. [http://portal.acm.org/citation.cfm?id=358466 Extended Boolean Information Retrieval], Communications of the ACM, 26, 1983, pp. 1022-1036 * 7. Gallant, Stephen I., [http://portal.acm.org/citation.cfm?id=42377 Connectionist Expert Systems], Communications of the ACM, v31 Feb '88, pl52(18) * 8. Gates, Bill. Comdex '91 speech on [http://findarticles.com/p/articles/mi_m0REL/is_n11_v90/ai_9715919/ Information at Your Fingertips] available for $8 on videotape from Microsoft's sales department. * 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.4726 Semantic File Systems], Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. * 10. Gilula, Mikhail. [http://portal.acm.org/citation.cfm?id=174888 The Set Model for Database and Information Systems], 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. * 11. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.4527 Joint Object Services Submission] (JOSS), OMG TC Document 93.5.1 * 12. Marchionini, Gary., and Shneiderman, Ben. [http://portal.acm.org/citation.cfm?id=619765 Finding Facts vs. Browsing Knowledge in Hypertext Systems] Computer, January 1988, p. 70 * 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 * 14. Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen [http://domino.watson.ibm.com/library/cyberdig.nsf/a3807c5b4823c53f85256561006324be/1e2deed787c18fbc85256593006f843c?OpenDocument Rufus: The Information Sponge] Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center * 15. Metzler and Haas. [http://portal.acm.org/citation.cfm?id=65943.65949 The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval], Proceedings of the ACM SIGIR Conference, 1989, ACM Press, * 16. Nelson, T.H. [http://www.eastgate.com/catalog/LiteraryMachines.html Literary Machines], self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. * 17. Mozer, Nfichael C. [http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?accno=ED245694 Inductive Information Retrieval Using Parallel Distributed Computation], UCLA * 18. Pike, Rob and P.J. Weinberger ... The Hideous Name "AT&T Research Report" * 19. Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9]. Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. * 20. Potter, Walter D. and Robert P. Trueblood, [http://portal.acm.org/citation.cfm?id=45937 Traditional, semantic, and hyper-semantic approaches to data modeling] v21 Computer '88 p53(1 1) * 21. Rijsbergen, C. J. Van, [http://www.dcs.gla.ac.uk/Keith/Preface.html Information Retrieval] - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge * 22. Salton, G. (1986) [http://portal.acm.org/citation.cfm?id=6149 Another Look At Automatic Text-Retrieval Systems], Communications of the ACM, 29, 648-656 * 23. Smith, J.M. and D.C. Smith, [http://portal.acm.org/citation.cfm?id=320546 Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems], June 1977, pp. 105-133 ICS Report No. 8406 June 1984 * 24. [http://www.win.tue.nl/~aeb/partitions/partition_types.html Partition types] by [mailto:aeb@cwi.nl Andries Brouwer], 2009-06-25 [[category:Reiser4]] 7e1501c3b87462feb8f459edeeeab56e6a33b733 1742 1739 2010-04-25T04:37:08Z Chris goe 2 <title> taken from the html source {{wayback|http://www.namesys.com/whitepaper.html|2006-11-13}} Futurue Vision of ReiserFS Name Spaces As Tools for Integrating the Operating System Rather Than As Ends in Themselves By Hans Reiser http://namesys.com 6114 La Salle ave., #405 Oakland, CA 94611 email: reiser@namesys.com == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, [[Txn-doc|Reiser4 Transaction Design Document]] and [[Reiser4|Reiser4 Whitepaper]]. == References == * 1. Blair, David C. and Marron, M. E. [http://portal.acm.org/citation.cfm?doid=3166.3197 Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System] Communications of the ACM v 28 n 3 Mar 1985 p289-299 * 2. Codd, E. F. [http://portal.acm.org/citation.cfm?id=77708 The Relational Model for Database Management: version 2] c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. * 3. Date, C.J. [http://portal.acm.org/citation.cfm?id=4198 An Introduction to Database Systems], 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. * 4. Curtis, Ronald and Larry Wittie [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=35714&arnumber=1695185 Global Naming in Distributed Systems] IEEE Software July 1984 p76-80 * 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, [http://portal.acm.org/citation.cfm?id=42372.42378 Computing with Structured Connectionist Networks] Communications of the ACM, v31 Feb '88, p170(18) * 6. Fox, E. A., and Wu, H. [http://portal.acm.org/citation.cfm?id=358466 Extended Boolean Information Retrieval], Communications of the ACM, 26, 1983, pp. 1022-1036 * 7. Gallant, Stephen I., [http://portal.acm.org/citation.cfm?id=42377 Connectionist Expert Systems], Communications of the ACM, v31 Feb '88, pl52(18) * 8. Gates, Bill. Comdex '91 speech on [http://findarticles.com/p/articles/mi_m0REL/is_n11_v90/ai_9715919/ Information at Your Fingertips] available for $8 on videotape from Microsoft's sales department. * 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.4726 Semantic File Systems], Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. * 10. Gilula, Mikhail. [http://portal.acm.org/citation.cfm?id=174888 The Set Model for Database and Information Systems], 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. * 11. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.4527 Joint Object Services Submission] (JOSS), OMG TC Document 93.5.1 * 12. Marchionini, Gary., and Shneiderman, Ben. [http://portal.acm.org/citation.cfm?id=619765 Finding Facts vs. Browsing Knowledge in Hypertext Systems] Computer, January 1988, p. 70 * 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 * 14. Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen [http://domino.watson.ibm.com/library/cyberdig.nsf/a3807c5b4823c53f85256561006324be/1e2deed787c18fbc85256593006f843c?OpenDocument Rufus: The Information Sponge] Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center * 15. Metzler and Haas. [http://portal.acm.org/citation.cfm?id=65943.65949 The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval], Proceedings of the ACM SIGIR Conference, 1989, ACM Press, * 16. Nelson, T.H. [http://www.eastgate.com/catalog/LiteraryMachines.html Literary Machines], self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. * 17. Mozer, Nfichael C. [http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?accno=ED245694 Inductive Information Retrieval Using Parallel Distributed Computation], UCLA * 18. Pike, Rob and P.J. Weinberger ... The Hideous Name "AT&T Research Report" * 19. Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9]. Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. * 20. Potter, Walter D. and Robert P. Trueblood, [http://portal.acm.org/citation.cfm?id=45937 Traditional, semantic, and hyper-semantic approaches to data modeling] v21 Computer '88 p53(1 1) * 21. Rijsbergen, C. J. Van, [http://www.dcs.gla.ac.uk/Keith/Preface.html Information Retrieval] - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge * 22. Salton, G. (1986) [http://portal.acm.org/citation.cfm?id=6149 Another Look At Automatic Text-Retrieval Systems], Communications of the ACM, 29, 648-656 * 23. Smith, J.M. and D.C. Smith, [http://portal.acm.org/citation.cfm?id=320546 Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems], June 1977, pp. 105-133 ICS Report No. 8406 June 1984 * 24. [http://www.win.tue.nl/~aeb/partitions/partition_types.html Partition types] by [mailto:aeb@cwi.nl Andries Brouwer], 2009-06-25 [[category:Reiser4]] 22a77355c5d5bb18cd32efab59eed08c57f1a125 1739 1689 2010-04-25T04:32:52Z Chris goe 2 wayback template used {{wayback|http://www.namesys.com/whitepaper.html|2006-11-13}} The Naming System Venture By Hans Reiser http://namesys.com 6114 La Salle ave., #405, Oakland, CA 94611 email: reiser@namesys.com == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, [[Txn-doc|Reiser4 Transaction Design Document]] and [[Reiser4|Reiser4 Whitepaper]]. == References == * 1. Blair, David C. and Marron, M. E. [http://portal.acm.org/citation.cfm?doid=3166.3197 Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System] Communications of the ACM v 28 n 3 Mar 1985 p289-299 * 2. Codd, E. F. [http://portal.acm.org/citation.cfm?id=77708 The Relational Model for Database Management: version 2] c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. * 3. Date, C.J. [http://portal.acm.org/citation.cfm?id=4198 An Introduction to Database Systems], 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. * 4. Curtis, Ronald and Larry Wittie [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=35714&arnumber=1695185 Global Naming in Distributed Systems] IEEE Software July 1984 p76-80 * 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, [http://portal.acm.org/citation.cfm?id=42372.42378 Computing with Structured Connectionist Networks] Communications of the ACM, v31 Feb '88, p170(18) * 6. Fox, E. A., and Wu, H. [http://portal.acm.org/citation.cfm?id=358466 Extended Boolean Information Retrieval], Communications of the ACM, 26, 1983, pp. 1022-1036 * 7. Gallant, Stephen I., [http://portal.acm.org/citation.cfm?id=42377 Connectionist Expert Systems], Communications of the ACM, v31 Feb '88, pl52(18) * 8. Gates, Bill. Comdex '91 speech on [http://findarticles.com/p/articles/mi_m0REL/is_n11_v90/ai_9715919/ Information at Your Fingertips] available for $8 on videotape from Microsoft's sales department. * 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.4726 Semantic File Systems], Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. * 10. Gilula, Mikhail. [http://portal.acm.org/citation.cfm?id=174888 The Set Model for Database and Information Systems], 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. * 11. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.4527 Joint Object Services Submission] (JOSS), OMG TC Document 93.5.1 * 12. Marchionini, Gary., and Shneiderman, Ben. [http://portal.acm.org/citation.cfm?id=619765 Finding Facts vs. Browsing Knowledge in Hypertext Systems] Computer, January 1988, p. 70 * 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 * 14. Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen [http://domino.watson.ibm.com/library/cyberdig.nsf/a3807c5b4823c53f85256561006324be/1e2deed787c18fbc85256593006f843c?OpenDocument Rufus: The Information Sponge] Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center * 15. Metzler and Haas. [http://portal.acm.org/citation.cfm?id=65943.65949 The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval], Proceedings of the ACM SIGIR Conference, 1989, ACM Press, * 16. Nelson, T.H. [http://www.eastgate.com/catalog/LiteraryMachines.html Literary Machines], self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. * 17. Mozer, Nfichael C. [http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?accno=ED245694 Inductive Information Retrieval Using Parallel Distributed Computation], UCLA * 18. Pike, Rob and P.J. Weinberger ... The Hideous Name "AT&T Research Report" * 19. Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9]. Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. * 20. Potter, Walter D. and Robert P. Trueblood, [http://portal.acm.org/citation.cfm?id=45937 Traditional, semantic, and hyper-semantic approaches to data modeling] v21 Computer '88 p53(1 1) * 21. Rijsbergen, C. J. Van, [http://www.dcs.gla.ac.uk/Keith/Preface.html Information Retrieval] - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge * 22. Salton, G. (1986) [http://portal.acm.org/citation.cfm?id=6149 Another Look At Automatic Text-Retrieval Systems], Communications of the ACM, 29, 648-656 * 23. Smith, J.M. and D.C. Smith, [http://portal.acm.org/citation.cfm?id=320546 Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems], June 1977, pp. 105-133 ICS Report No. 8406 June 1984 * 24. [http://www.win.tue.nl/~aeb/partitions/partition_types.html Partition types] by [mailto:aeb@cwi.nl Andries Brouwer], 2009-06-25 [[category:Reiser4]] 4a2f8fb7aca0ece6729770f38f01c19c4cc0d8ca 1689 1609 2010-04-16T04:26:40Z Chris goe 2 document source The Naming System Venture By Hans Reiser http://namesys.com 6114 La Salle ave., #405, Oakland, CA 94611 email: reiser@namesys.com __TOC__ == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, [[Txn-doc|Reiser4 Transaction Design Document]] and [[Reiser4|Reiser4 Whitepaper]]. == References == * 1. Blair, David C. and Marron, M. E. [http://portal.acm.org/citation.cfm?doid=3166.3197 Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System] Communications of the ACM v 28 n 3 Mar 1985 p289-299 * 2. Codd, E. F. [http://portal.acm.org/citation.cfm?id=77708 The Relational Model for Database Management: version 2] c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. * 3. Date, C.J. [http://portal.acm.org/citation.cfm?id=4198 An Introduction to Database Systems], 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. * 4. Curtis, Ronald and Larry Wittie [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=35714&arnumber=1695185 Global Naming in Distributed Systems] IEEE Software July 1984 p76-80 * 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, [http://portal.acm.org/citation.cfm?id=42372.42378 Computing with Structured Connectionist Networks] Communications of the ACM, v31 Feb '88, p170(18) * 6. Fox, E. A., and Wu, H. [http://portal.acm.org/citation.cfm?id=358466 Extended Boolean Information Retrieval], Communications of the ACM, 26, 1983, pp. 1022-1036 * 7. Gallant, Stephen I., [http://portal.acm.org/citation.cfm?id=42377 Connectionist Expert Systems], Communications of the ACM, v31 Feb '88, pl52(18) * 8. Gates, Bill. Comdex '91 speech on [http://findarticles.com/p/articles/mi_m0REL/is_n11_v90/ai_9715919/ Information at Your Fingertips] available for $8 on videotape from Microsoft's sales department. * 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.4726 Semantic File Systems], Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. * 10. Gilula, Mikhail. [http://portal.acm.org/citation.cfm?id=174888 The Set Model for Database and Information Systems], 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. * 11. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.4527 Joint Object Services Submission] (JOSS), OMG TC Document 93.5.1 * 12. Marchionini, Gary., and Shneiderman, Ben. [http://portal.acm.org/citation.cfm?id=619765 Finding Facts vs. Browsing Knowledge in Hypertext Systems] Computer, January 1988, p. 70 * 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 * 14. Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen [http://domino.watson.ibm.com/library/cyberdig.nsf/a3807c5b4823c53f85256561006324be/1e2deed787c18fbc85256593006f843c?OpenDocument Rufus: The Information Sponge] Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center * 15. Metzler and Haas. [http://portal.acm.org/citation.cfm?id=65943.65949 The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval], Proceedings of the ACM SIGIR Conference, 1989, ACM Press, * 16. Nelson, T.H. [http://www.eastgate.com/catalog/LiteraryMachines.html Literary Machines], self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. * 17. Mozer, Nfichael C. [http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?accno=ED245694 Inductive Information Retrieval Using Parallel Distributed Computation], UCLA * 18. Pike, Rob and P.J. Weinberger ... The Hideous Name "AT&T Research Report" * 19. Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9]. Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. * 20. Potter, Walter D. and Robert P. Trueblood, [http://portal.acm.org/citation.cfm?id=45937 Traditional, semantic, and hyper-semantic approaches to data modeling] v21 Computer '88 p53(1 1) * 21. Rijsbergen, C. J. Van, [http://www.dcs.gla.ac.uk/Keith/Preface.html Information Retrieval] - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge * 22. Salton, G. (1986) [http://portal.acm.org/citation.cfm?id=6149 Another Look At Automatic Text-Retrieval Systems], Communications of the ACM, 29, 648-656 * 23. Smith, J.M. and D.C. Smith, [http://portal.acm.org/citation.cfm?id=320546 Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems], June 1977, pp. 105-133 ICS Report No. 8406 June 1984 * 24. [http://www.win.tue.nl/~aeb/partitions/partition_types.html Partition types] by [mailto:aeb@cwi.nl Andries Brouwer], 2009-06-25 [[category:Reiser4]] 7d5d84f8976cff982aa73419db0a0256ccf7e97e 1609 1608 2009-07-06T07:16:33Z Chris goe 2 more urls added The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, [[Txn-doc|Reiser4 Transaction Design Document]] and [[Reiser4|Reiser4 Whitepaper]]. == References == * 1. Blair, David C. and Marron, M. E. [http://portal.acm.org/citation.cfm?doid=3166.3197 Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System] Communications of the ACM v 28 n 3 Mar 1985 p289-299 * 2. Codd, E. F. [http://portal.acm.org/citation.cfm?id=77708 The Relational Model for Database Management: version 2] c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. * 3. Date, C.J. [http://portal.acm.org/citation.cfm?id=4198 An Introduction to Database Systems], 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. * 4. Curtis, Ronald and Larry Wittie [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=35714&arnumber=1695185 Global Naming in Distributed Systems] IEEE Software July 1984 p76-80 * 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, [http://portal.acm.org/citation.cfm?id=42372.42378 Computing with Structured Connectionist Networks] Communications of the ACM, v31 Feb '88, p170(18) * 6. Fox, E. A., and Wu, H. [http://portal.acm.org/citation.cfm?id=358466 Extended Boolean Information Retrieval], Communications of the ACM, 26, 1983, pp. 1022-1036 * 7. Gallant, Stephen I., [http://portal.acm.org/citation.cfm?id=42377 Connectionist Expert Systems], Communications of the ACM, v31 Feb '88, pl52(18) * 8. Gates, Bill. Comdex '91 speech on [http://findarticles.com/p/articles/mi_m0REL/is_n11_v90/ai_9715919/ Information at Your Fingertips] available for $8 on videotape from Microsoft's sales department. * 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.4726 Semantic File Systems], Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. * 10. Gilula, Mikhail. [http://portal.acm.org/citation.cfm?id=174888 The Set Model for Database and Information Systems], 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. * 11. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.4527 Joint Object Services Submission] (JOSS), OMG TC Document 93.5.1 * 12. Marchionini, Gary., and Shneiderman, Ben. [http://portal.acm.org/citation.cfm?id=619765 Finding Facts vs. Browsing Knowledge in Hypertext Systems] Computer, January 1988, p. 70 * 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 * 14. Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen [http://domino.watson.ibm.com/library/cyberdig.nsf/a3807c5b4823c53f85256561006324be/1e2deed787c18fbc85256593006f843c?OpenDocument Rufus: The Information Sponge] Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center * 15. Metzler and Haas. [http://portal.acm.org/citation.cfm?id=65943.65949 The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval], Proceedings of the ACM SIGIR Conference, 1989, ACM Press, * 16. Nelson, T.H. [http://www.eastgate.com/catalog/LiteraryMachines.html Literary Machines], self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. * 17. Mozer, Nfichael C. [http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?accno=ED245694 Inductive Information Retrieval Using Parallel Distributed Computation], UCLA * 18. Pike, Rob and P.J. Weinberger ... The Hideous Name "AT&T Research Report" * 19. Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9]. Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. * 20. Potter, Walter D. and Robert P. Trueblood, [http://portal.acm.org/citation.cfm?id=45937 Traditional, semantic, and hyper-semantic approaches to data modeling] v21 Computer '88 p53(1 1) * 21. Rijsbergen, C. J. Van, [http://www.dcs.gla.ac.uk/Keith/Preface.html Information Retrieval] - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge * 22. Salton, G. (1986) [http://portal.acm.org/citation.cfm?id=6149 Another Look At Automatic Text-Retrieval Systems], Communications of the ACM, 29, 648-656 * 23. Smith, J.M. and D.C. Smith, [http://portal.acm.org/citation.cfm?id=320546 Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems], June 1977, pp. 105-133 ICS Report No. 8406 June 1984 * 24. [http://www.win.tue.nl/~aeb/partitions/partition_types.html Partition types] by [mailto:aeb@cwi.nl Andries Brouwer], 2009-06-25 [[category:Reiser4]] fdb790264239e0ada04041aa631755fe3013eccd 1608 1607 2009-07-06T06:51:16Z Chris goe 2 urls added The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, [[Txn-doc|Reiser4 Transaction Design Document]] and [[Reiser4|Reiser4 Whitepaper]]. == References == * 1. Blair, David C. and Marron, M. E. [http://portal.acm.org/citation.cfm?doid=3166.3197 Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System] Communications of the ACM v 28 n 3 Mar 1985 p289-299 * 2. Codd, E. F. [http://portal.acm.org/citation.cfm?id=77708 The Relational Model for Database Management: version 2] c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. * 3. Date, C.J. [http://portal.acm.org/citation.cfm?id=4198 An Introduction to Database Systems], 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. * 4. Curtis, Ronald and Larry Wittie [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=35714&arnumber=1695185 Global Naming in Distributed Systems] IEEE Software July 1984 p76-80 * 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, [http://portal.acm.org/citation.cfm?id=42372.42378 Computing with Structured Connectionist Networks] Communications of the ACM, v31 Feb '88, p170(18) * 6. Fox, E. A., and Wu, H. [http://portal.acm.org/citation.cfm?id=358466 Extended Boolean Information Retrieval], Communications of the ACM, 26, 1983, pp. 1022-1036 * 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) * 8. Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. * 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. * 10. Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. * 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 * 12. Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 * 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 * 14. Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center * 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, * 16. Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. * 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA * 18. Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" * 19. Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. * 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) * 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge * 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 * 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 * 24. http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] be7d67553d1740a16baaf622fbac4bea7edaf15a 1607 1605 2009-07-06T06:43:54Z Chris goe 2 papers linked The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, [[Txn-doc|Reiser4 Transaction Design Document]] and [[Reiser4|Reiser4 Whitepaper]]. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] 581ea1d9408a51af84536068ca32efb1c1c173cc 1605 1604 2009-07-06T06:40:53Z Chris goe 2 = The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties ==== Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] d4c78ae4f49d7b662296140e50fb7975e7943178 1604 1603 2009-07-06T06:39:59Z Chris goe 2 table added The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === {| border=1 | <Object Name>::= || <pre> <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; </pre> |} See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into '''<Storage_Keys>''', and when processing the compound names it first converts the subnames into '''<Storage_Keys>''', though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a '''<Object Name>''' for, and then invent a method for the resolver to convert the '''<Object Name>''' into a '''<Storage_Key>''', as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an '''<Object Name>''' is converted into a '''<Storage_Key>''' for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the '''<Storage_Key>'''s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A '''<Storage_Key>''' is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the '''<Storage_Key>''' any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties === Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] 17fd40d9fdec7fab0780532b662f012451cf8299 1603 1602 2009-07-06T06:32:23Z Chris goe 2 pictures! The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples <tt>/etc/passwd</tt> [[Image:Passwd.jpg]] Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] [[Image:Bilbo.jpg]] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] [[Image:Syntax-barrier.jpg]] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] [[Image:My-secrets.jpg]] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] [[Image:Ultimatum.jpg]] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] [[Image:Pruner.jpg]] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === <Object Name>::= <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into <Storage_Keys>, and when processing the compound names it first converts the subnames into <Storage_Keys>, though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a <Object Name> for, and then invent a method for the resolver to convert the <Object Name> into a <Storage_Key>, as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an <Object Name> is converted into a <Storage_Key> for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the <Storage_Key>s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A <Storage_Key> is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the <Storage_Key> any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties === Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] 676101af1137f3d7980d2752000af268ef89f84d 1602 1601 2009-07-06T04:01:46Z Chris goe 2 sup/sub :) The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 100<sup>10</sup> = 10<sup>20</sup> = good. REGRES has a selectivity of: 100<sup>(chance that the other two members of an object's threesome are possessed by user x 15)</sup> = l00<sup>(9/300x8/300x 15)</sup> = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples /etc/passwd Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === <Object Name>::= <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into <Storage_Keys>, and when processing the compound names it first converts the subnames into <Storage_Keys>, though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a <Object Name> for, and then invent a method for the resolver to convert the <Object Name> into a <Storage_Key>, as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an <Object Name> is converted into a <Storage_Key> for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the <Storage_Key>s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A <Storage_Key> is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the <Storage_Key> any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties === Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] 20901fb847852e660303adfe939dc2203c700211 1601 1600 2009-07-06T02:12:51Z Chris goe 2 formatting fixes The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: <tt>"My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you."</tt> George doesn't know Santa Claus' name: <tt>"I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it."</tt> [[Image:Reindeer.jpg]] FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: <tt>"OK, now let's define a query.'''is-a equals man''', that's easy. But reindeer? Is reindeer a property of this man?"</tt> <tt>"Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere."</tt> <tt>"Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old."</tt> Kali corrects him. <tt>"Let's see if we can structure this properly. Is reindeer an '''instance-of''' of this man? A '''member-of''' of this man? It couldn't be a '''generalization''' of this man. Hmm..."</tt> <tt>"No! It's not that complicated. They just have something to do with him."</tt> <tt>"Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?"</tt> <tt>"No. Try '''propulsion-provider-for'''."</tt> <tt>"Do you think that that was the schema the person who put the information in our system used?"</tt> <tt>"No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?"</tt> Kali feels satisfaction. <tt>"Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad."</tt> <tt>"A keyword system could handle reindeer chimneys christmas man."</tt> George grumbles as he stares in despair at his display. Unfortunately, the ''Library of Congress'' is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. <tt>"But a keyword system would delete even necessary structure inherent to the information. It couldn't handle our other needs!"</tt> Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 10010 = 1020= good. REGRES has a selectivity of: 100(chance that the other two members of an object's threesome are possessed by user x 15) = l00(9/300x8/300x 15) = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples /etc/passwd Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === <Object Name>::= <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into <Storage_Keys>, and when processing the compound names it first converts the subnames into <Storage_Keys>, though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a <Object Name> for, and then invent a method for the resolver to convert the <Object Name> into a <Storage_Key>, as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an <Object Name> is converted into a <Storage_Key> for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the <Storage_Key>s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A <Storage_Key> is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the <Storage_Key> any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties === Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] 79b16140761290fa07ec8f9aa1dfcaf2a5ec4799 1600 1599 2009-07-06T02:01:35Z Chris goe 2 email addres fixed, link removed (dead anyway) The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the {{listaddress}} mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: "My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you." George doesn't know Santa Claus' name: "I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it." FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: "OK, now let's define a query.is-a equals man, that's easy. But reindeer? Is reindeer a property of this man?" "Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere." "Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old." Kali corrects him. "Let's see if we can structure this properly. Is reindeer an instance-of of this man? A member-of of this man? It couldn't be a generalization of this man. Hmm..." "No! It's not that complicated. They just have something to do with him." "Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?" "No. Try propulsion-provider-for." "Do you think that that was the schema the person who put the information in our system used?" "No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?" Kali feels satisfaction. "Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad." "A keyword system could handle reindeer chimneys christmas man." George grumbles as he stares in despair at his display. Unfortunately, the Library of Congress is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. "But a keyword system would delete even necessary structure inherent to the information.It couldn't handle our other needs!" Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 10010 = 1020= good. REGRES has a selectivity of: 100(chance that the other two members of an object's threesome are possessed by user x 15) = l00(9/300x8/300x 15) = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples /etc/passwd Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === <Object Name>::= <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into <Storage_Keys>, and when processing the compound names it first converts the subnames into <Storage_Keys>, though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a <Object Name> for, and then invent a method for the resolver to convert the <Object Name> into a <Storage_Key>, as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an <Object Name> is converted into a <Storage_Key> for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the <Storage_Key>s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A <Storage_Key> is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the <Storage_Key> any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties === Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] 5160fcca5d80e287e5b0d6f2c55ac2e7082fb532 1599 1590 2009-07-06T02:00:15Z Chris goe 2 lowercase The Naming System Venture == Abstract == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the reiserfs-list@namesys.com mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch at http://namesys.com. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: "My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you." George doesn't know Santa Claus' name: "I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it." FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: "OK, now let's define a query.is-a equals man, that's easy. But reindeer? Is reindeer a property of this man?" "Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere." "Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old." Kali corrects him. "Let's see if we can structure this properly. Is reindeer an instance-of of this man? A member-of of this man? It couldn't be a generalization of this man. Hmm..." "No! It's not that complicated. They just have something to do with him." "Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?" "No. Try propulsion-provider-for." "Do you think that that was the schema the person who put the information in our system used?" "No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?" Kali feels satisfaction. "Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad." "A keyword system could handle reindeer chimneys christmas man." George grumbles as he stares in despair at his display. Unfortunately, the Library of Congress is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. "But a keyword system would delete even necessary structure inherent to the information.It couldn't handle our other needs!" Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 10010 = 1020= good. REGRES has a selectivity of: 100(chance that the other two members of an object's threesome are possessed by user x 15) = l00(9/300x8/300x 15) = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples /etc/passwd Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === <Object Name>::= <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into <Storage_Keys>, and when processing the compound names it first converts the subnames into <Storage_Keys>, though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a <Object Name> for, and then invent a method for the resolver to convert the <Object Name> into a <Storage_Key>, as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an <Object Name> is converted into a <Storage_Key> for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the <Storage_Key>s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A <Storage_Key> is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the <Storage_Key> any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties === Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] 3636fb9d8fe37ce5453829a60a4287247d460287 1590 1523 2009-07-06T01:52:36Z Chris goe 2 formatting fixes The Naming System Venture == ABSTRACT == For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the reiserfs-list@namesys.com mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch at http://namesys.com. == Introduction == Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: "My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you." George doesn't know Santa Claus' name: "I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it." FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: "OK, now let's define a query.is-a equals man, that's easy. But reindeer? Is reindeer a property of this man?" "Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere." "Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old." Kali corrects him. "Let's see if we can structure this properly. Is reindeer an instance-of of this man? A member-of of this man? It couldn't be a generalization of this man. Hmm..." "No! It's not that complicated. They just have something to do with him." "Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?" "No. Try propulsion-provider-for." "Do you think that that was the schema the person who put the information in our system used?" "No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?" Kali feels satisfaction. "Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad." "A keyword system could handle reindeer chimneys christmas man." George grumbles as he stares in despair at his display. Unfortunately, the Library of Congress is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. "But a keyword system would delete even necessary structure inherent to the information.It couldn't handle our other needs!" Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. == A Few of the Other Approaches to This Problem == There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. == Names as Random Subsets of the Information In an Object == A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 10010 = 1020= good. REGRES has a selectivity of: 100(chance that the other two members of an object's threesome are possessed by user x 15) = l00(9/300x8/300x 15) = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. == The Syntax In More Detail == What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. === Orthogonal or Unoriginal Primitives and Features === There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. === Vicinity Set Intersection Definition (Also Called Grouping) === Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. === Synthesizing Ordering and Grouping === I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples /etc/passwd Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. === The Formal Definitions === <Object Name>::= <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into <Storage_Keys>, and when processing the compound names it first converts the subnames into <Storage_Keys>, though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a <Object Name> for, and then invent a method for the resolver to convert the <Object Name> into a <Storage_Key>, as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an <Object Name> is converted into a <Storage_Key> for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the <Storage_Key>s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A <Storage_Key> is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the <Storage_Key> any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. == The Primitives == <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. === Ordering can be implemented more efficiently than grouping === The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. === The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. === An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) == Style == As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. === When To Decompound Groupings === There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. == Examples of Creating Associations == <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. == Other Projects Seeking To Increase Closure In The OS == === ATT's Plan 9 === [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. === Microsoft's Information At Your Fingertips === Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. ==== Internet Explorer ==== Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. ==== Microsoft's Well Known Performance Difficulties === Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... === Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed === When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. === Rufus === The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. === Semantic File System === The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). === OS/400 === IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. == Conclusion == While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. == Acknowledgments == David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. == References == 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] 39a5e8e21391ad089eeb02f240e36c268af68081 1523 1341 2009-06-27T19:24:07Z Chris goe 2 formatting-fixes-needed By Hans Reiser The Naming System Venture http://namesys.com 6114 La Salle ave., #405, Oakland, CA 94611 email: reiser@namesys.com ABSTRACT For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the reiserfs-list@namesys.com mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch at http://namesys.com. Introduction Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: "My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you." George doesn't know Santa Claus' name: "I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it." FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: "OK, now let's define a query.is-a equals man, that's easy. But reindeer? Is reindeer a property of this man?" "Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere." "Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old." Kali corrects him. "Let's see if we can structure this properly. Is reindeer an instance-of of this man? A member-of of this man? It couldn't be a generalization of this man. Hmm..." "No! It's not that complicated. They just have something to do with him." "Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?" "No. Try propulsion-provider-for." "Do you think that that was the schema the person who put the information in our system used?" "No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?" Kali feels satisfaction. "Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad." "A keyword system could handle reindeer chimneys christmas man." George grumbles as he stares in despair at his display. Unfortunately, the Library of Congress is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. "But a keyword system would delete even necessary structure inherent to the information.It couldn't handle our other needs!" Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. A Few of the Other Approaches to This Problem There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. Names as Random Subsets of the Information In an Object A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 10010 = 1020= good. REGRES has a selectivity of: 100(chance that the other two members of an object's threesome are possessed by user x 15) = l00(9/300x8/300x 15) = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. The Syntax In More Detail What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. Orthogonal or Unoriginal Primitives and Features There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. Vicinity Set Intersection Definition (Also Called Grouping) Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. Synthesizing Ordering and Grouping I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples /etc/passwd Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. The Formal Definitions <Object Name>::= <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into <Storage_Keys>, and when processing the compound names it first converts the subnames into <Storage_Keys>, though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a <Object Name> for, and then invent a method for the resolver to convert the <Object Name> into a <Storage_Key>, as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an <Object Name> is converted into a <Storage_Key> for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the <Storage_Key>s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A <Storage_Key> is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the <Storage_Key> any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. The Primitives <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. Ordering can be implemented more efficiently than grouping The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) Style As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. When To Decompound Groupings There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. Examples of Creating Associations <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. Other Projects Seeking To Increase Closure In The OS ATT's Plan 9 [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. Microsoft's Information At Your Fingertips Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. Internet Explorer Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. Microsoft's Well Known Performance Difficulties Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. Rufus The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. Semantic File System The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). OS/400 IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. Conclusion While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. Acknowledgments David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. References 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html [[category:Reiser4]] [[category:Formatting-fixes-needed]] 4cc685925a1ddc4518149b92b976c8d23f56ed0d 1341 2009-06-25T08:06:18Z Chris goe 2 http://web.archive.org/web/20061113154621/www.namesys.com/whitepaper.html By Hans Reiser The Naming System Venture http://namesys.com 6114 La Salle ave., #405, Oakland, CA 94611 email: reiser@namesys.com ABSTRACT For too long the file system has been semantically impoverished in comparison with database and keyword systems. It is time to change! The current lack of features makes it much easier to use the latest set theoretic models rather than older models of relational algebra or hypertext. The current FS syntax fits nicely into the newer model. The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies. Yet before doing that, we need a core architecture for the semantics to ensure we end up with a coherent whole. This paper suggests a set theoretic model for those semantics. The relational models would at times unacceptably add structure to information, the keyword models would at times delete structure, and purely hierarchical models would create information mazes. Reworking their primitives is required to synthesize the best attributes of these models in a way that allows one the flexibility to tailor the level of structure to the need of the moment. The set theoretic model I propose has a syntax that is Linux, MacOS, and DOS file system syntax upwardly compatible, as well as CORBA naming layer upwardly compatible. This is a planning document for the next major version of ReiserFS, that is, a description of vaporware. It is useful to ReiserFS users and contributors who want to know where we are going, and why we are building all sorts of strange optimizations into the storage layer (and especially those who are willing to help shape the vision in the course of discussions on the reiserfs-list@namesys.com mailing list....). Currently the storage layer for ReiserFS is working and useful as an everyday FS with conventional semantics. That storage layer is available as a GPL'd Linux kernel patch at http://namesys.com. Introduction Many OS researchers have built hierarchical name spaces that innovate in their effect on the integration of the operating system (e.g. Plan 9 and their file system [Pike].) Relational and keyword researchers rightfully scorn hierarchical name spaces as 20 years behind the state of the art [Date], but pay little attention to integration of the operating system as a design objective in their own work, or as a possible influence on data model design. I won't go into that here. Limiting associations to single key words is an unnecessary restriction. A Naming System Should Reflect Rather than Mold Structure The importance of not deleting the structure of information is obvious; few would advocate using the keyword model to unify naming. What can be more difficult to see is the harm from adding structure to information; some do recommend the relational model for unifying naming (e.g. OS/400). By decomposing a primitive of a model into smaller primitives one can end up with a more general model, one with greater flexibility of application. This is the very normal practice of mathematicians, who in their work constantly examine mathematical models with an eye to finding a more fundamental set of primitives, in hopes that a new formulation of the model will allow the new primitives to function more independently, and thereby increase the generality and expressive power of the model. Here I break the relational primitive (a tuple is an unordered set of ordered pairs) into separate ordered and unordered set primitives. Relational systems force you to use unordered sets of ordered pairs when sometimes what you want is a simple unordered set. Why should a naming system match rather than mold the structure of information? For systems of low complexity, the reasons are deeply philosophical, which means uncompelling. And for multiterabyte distributed systems?... Reiser's Rule of Thumb #2: The most important characteristic of a very complex system is the user's inability to learn its structure as a whole. We must avoid adding structure, or guarantee that the user will be informed of all structure relevant to his partial information. Avoiding adding structure is both more feasible and less burdensome to the user. Hierarchical, relational, semantic, and hypersemantic systems all force structure on information, structure inherent in the system rather than the information represented. If a system adds structure, and the user is trying to exploit partial knowledge (such as a name embodies), then it inevitably requires the user to learn what was added before he can employ his partial knowledge. With complex systems, the amount added is beyond the capacity of users to learn, and information is lost. Example: "My name is Kali, your friendly whitepaper.html technical support specialist for REGRES. Our system puts the Library of Congress online! How may I help you." George doesn't know Santa Claus' name: "I'm trying to find the reindeer chimneys christmas man, and I can't get your system to do it." FIGURE 1. Graphical representation of a typical simple unordered set that is difficult for relational systems. Kali says: "OK, now let's define a query.is-a equals man, that's easy. But reindeer? Is reindeer a property of this man?" "Uh no. I wish I could remember the dude's name. I read this story about him a long time ago, and all I can remember is that he had something to do with reindeer and chimneys. The story is on-line, somewhere." "Reindeer chimneys presents man, that's the sort of speech pattern I'd expect from a three year old." Kali corrects him. "Let's see if we can structure this properly. Is reindeer an instance-of of this man? A member-of of this man? It couldn't be a generalization of this man. Hmm..." "No! It's not that complicated. They just have something to do with him." "Pavlov would probably say you associate reindeer with this man, the way the unstructured mind of an animal thinks. But here in technical support we try to help our customers become more sophisticated. Is reindeer a property of this man?" "No. Try propulsion-provider-for." "Do you think that that was the schema the person who put the information in our system used?" "No. Shoot. I can think of a dozen different columns it could be under. But what are the chances that the ones I think of are going to be the same as the ones the dude who put the information in used?" Kali feels satisfaction. "Guess it can't be done, not if you can't structure your REGRES query properly. I'll put you down in my log as a closed ticket, 190 seconds to resolution, not bad." "A keyword system could handle reindeer chimneys christmas man." George grumbles as he stares in despair at his display. Unfortunately, the Library of Congress is only one of REGRES' many reference aids. George could spend his life at it, and he'd never learn its schema. "But a keyword system would delete even necessary structure inherent to the information.It couldn't handle our other needs!" Kali says before she hangs up. In addition to the searcher's difficulties, having to manufacture structure by specifying the column for reindeer also adds unnecessary cognitive load to the story author's indexing tasks. A Few of the Other Approaches to This Problem There is lurking at the heart of my approach a subtle difference between my analysis of naming, and the analysis of at least some others. I started my research by systematically categorizing the different structures embodied by names, placing them into equivalency classes, and then picking one syntax out of each class of functionally equivalent naming structures, on the assumption that each of the equivalency classes has value. For example, I considered that languages sometimes convey structure by word endings (tags), and sometimes by word order, but while the syntax differs, the word order and word ending techniques are equivalent in their power to convey structure. In my analysis of the effect of word ordering I decided that either the ordering mattered, or it did not, and that was the basis for two different naming primitives. Others have instead studied the inherent structure of data, and then from that derived ways of naming. The hypersemantic system [Smith] [Potter] represents an attempt to pick a manageably few columns which cover all possible needs. Generalization, aggregation, classification, and membership correspond to the is-a, has-property, is-an-instance-of, and is-a-member-of columns, respectively. The minor problem is that these columns don't cover all possibilities. They don't cover reindeer, presents, or chimneys for George's query. The major problem is that they don't correspond as close as is possible to the most common style of human thought, simple unordered association, and require cognitive effort to transform. The first response of relational database researchers to this is usually to ask: "Why not modify an existing relational database to contain an 'associated' column, put everything in that column, and it would be functionally equivalent to what you want." This is like saying that you can do everything Pascal can do using TeX macros. (They are both Turing complete.) We don't design languages to simply be Turing complete, we design them to be useful. I have seen a colleague do in six lines of SQL (nonstandard SQL) a simple three keyword unordered set that I do in 3 words plus a pair of delimiters, and that traditional keyword systems also handle easily. Doing simple unordered sets well is crucial for highly heterogeneous name spaces, and the market success of keyword systems in Internet searching is evidence of that. If you look at the structure of names in human languages, they are not all tuple structured, and to make them tuple structured might be to distort them. I have merely discussed the burden of naming columns. Most relational systems also require the user to specify the relation name. If column naming is a burden, naming both the column and the relation is no less a burden. Many systems invest effort into allowing you to take the key that you know, and figure out all the relation names and columns that you might choose to pair with it. This is a good idea, but not as good as not imposing extraneous structure to begin with. [Salton] can be read for devastating critiques of the document clustering system, but there is a worthwhile idea lurking within that system. Perhaps it is worthwhile to keep track of a small number of documents which are "close" to a given document. The document creator could be informed upon auto-indexing the document what other documents appear to be close to it, and asked to consider associating it with them. This is not within our current plan of work, but I don't reject it conceptually. In summary, modularity within the naming system is improved by recognizing unordered grouping and ordering as two different functions that deserve separate primitives rather than being combined into a tuple primitive. The tuple is an unordered set of ordered pairs. There are other useful combinations of unordered grouping and ordering than that embodied by the relation, and the success of keyword systems suggests that a plain unordered set without any ordering at all is the most fundamental and common of them. Names as Random Subsets of the Information In an Object A system may still be effective when its assumptions are known to be false. You may regard the above as an overstatement of the notion that we are neural nets, and sometimes our abstract systems deal with assumptions that are not true or false, but are somewhat true. After we are finished stating them in English they lose the delicate weighting possessed by the reality of the situation. Sometimes we find it easier to model without that weighting. Classical economics and its assumption of perfect competition is the best known example of an effective system based on an assumptions known to be substantially false. Introductory economics classes usually spend several weeks of class time arguing the merits of building models on somewhat false assumptions. This paper will now use such a somewhat false model to convey a feel for why mandatory pairing of name components causes problems. Assume the user's information from which he tries to construct a description will be some completely random subset of the information about the object. (Some of that information will be structural, and the structural fragments selected will be just as random as the rest.) Assume a user has 15 random clues of information selected from 300 pieces of information the system knows about some object. Assume the REGRES naming system requires that data be supplied in threesomes (perhaps column name, key name, relation name), and cannot use one member of a threesome without the other members of the threesome. Assume the ANARCHY naming system lacks this restriction, but does so at the cost that it can only use those 10 of the 15 information fragments which do not embody structure. Assume the statistical distribution of the 15 pieces of information the user has to construct a name with are fully independent and equally likely (this is both substantially wrong, and unfair to REGRES, but .... ) Assume each clue has a selectivity of 100 (it divides the number of objects returned by 100). Then ANARCHY has a selectivity of 10010 = 1020= good. REGRES has a selectivity of: 100(chance that the other two members of an object's threesome are possessed by user x 15) = l00(9/300x8/300x 15) = 1.06 = very bad While it is not true that the clues are fully independent, it is true that to the extent that they are not fully dependent, ANARCHY will gain in selectivity compared to REGRES. Attempting to quantify for any database the extent of the dependence would be a nightmare, and so this model assumes a substantial falsity, through which it is hoped the reader can see a greater truth. For databases of the lower heterogeneity and complexity that the relational model was designed for, the independence within a threesome can be small, and the ability to also employ the 5 of 15 fragments which are structural is often more important than the difficulty of guessing any structure added. There is an implicit assumption here that you are looking for information that others have structured, and this argument in favor of ANARCHY becomes much less strong without this assumption. I feel obligated to stress once again that I do not advocate low structure over high structure, but I do advocate having the flexibility to match the amount of structure to the needs of the moment. Only with such flexibility can one hope to use all of the 15 fragments that happen to be possessed. The Syntax In More Detail What's needed is a naming system intended to reflect just the structure inherent in the information, whatever that structure might be, rather than restructuring the information to fit the naming system. Orthogonal or Unoriginal Primitives and Features There are many primitives that the ultimate naming system would include but which I will not discuss here: macros, OR, weight for subnames and AND-OR connectors [Fox], rules, constraints, indirection, links, and others. I have tried to select only those aspects in which my approach differs from the standard approach. Unifying the namespace does not require unifying automatic name generation, and those who read the [Blair] vs. [Salton] controversy likely understand my concluding that whatever the benefits might be of unifying automatic name generation, it is not feasible now, and won't be feasible for a long time to come. The names one can assign an object are kept completely orthogonal from the contents of the object in the implementation of this naming layer. It is up to the owner of the object to name it, and it is up to him to use whatever combination of autonaming programs and manual naming best achieves his purpose. He may name it on object creation, and he may continually adjust its various names throughout its lifetime. See the section defining the "Key_Object primitive" for a discussion of why names should be thought of this way. Technically, object creation only requires the object be given an Storage_Key. In practice most users will in the same act that creates the object, also associate the object with at least one name that will spare them from directly specifying the Storage_Key in hex the next time they make a reference to it. For applications implementing external name spaces, they can interact with the storage layer by referencing just the Storage_Key. Namesys will provide a manual naming interface, and the API autonaming programs need to plug into. Companies such as Ecila will provide autonamers for various purposes. Ecila is implementing a program which scans remote stores, creates links to them in the unified name space, but leaves the data on the remote stores. Other programs may also be implemented to perform this general function. To be more specific, the Ecila search engine scans the web for documents in French, and uses the filesystem as an indexing engine. However, they are writing their engine to be a general purpose engine, they have sold support and the addition of extensions to it to other search engine companies, and it is open source. For now we are simply functioning as part of their engine, and the interface is by web browser: at some point we may be able to add their functionality to the namespace. While the implementation of Microsoft's attempt to blur the distinction between the filesystem name space and the web namespace is one more of appearance than substance, it is surely the right thing to do for Linux as well in the long run. We should simply make our integration one with substance and utility, rather than integrating mostly the look and feel. When the store is external to the primary store for the namespace, then stale names can be an issue with no clean resolution. That said, unification at just the naming layer is, in a real rather than ideal world, often quite useful, and so we have Internet search engines. GUI based naming is beyond the scope of this paper, except to mention that it is common for GUI namespaces to be designed such that they are not well integrated with the other namespaces of the OS. They are often though to necessarily be less powerful, but proper integration would make this untrue, as they would then be additional syntaxes not substitutes. These additional syntaxes should possess closure within the general name space, and thereby be capable of finding employment as components of compound names like all the other types of names. The compound names should be able to contain both GUI and non-GUI based name components. Integration would make them simply the aspect of naming that applies to what is present in the visual cache of the screen, and to how to manage and display that cache most effectively. Vicinity Set Intersection Definition (Also Called Grouping) Suppose you have a set X of objects. Suppose some of these objects are associated with each other. You can draw them as connected in a graph. Let the vicinity of an object A be the set of objects associated with A. Let there be a set of query objects Q. Then the set vicinity intersection of Q is the set of objects which are a member of all vicinities of the objects in Q. When thinking of this as a data model, it seems natural to use the term vicinity set intersection. When thinking of this syntactically, it seems natural to use the term "grouping", because it implies that the subnames are grouped together without the order of the subnames being significant. There is exactly one data model primitive (set vicinity intersection) possessing exactly one syntax (grouping), and I rarely intend to distinguish data model primitive from syntax primitive (I can be criticized for this), and yet I use both terms for it, forgive me. Synthesizing Ordering and Grouping I am going to describe a toy naming system that allows focusing on how best to combine, grouping and ordering into one naming system. This synthesis will contain the core features of the hierarchical, keyword, and relational systems as functional subsets. It consists of a few simple primitives, allowed to build on each other. It sets the discussion framework from which our project will over many years evolve a real naming system out of its current storage layer implementation. Resolving the second component of an ordering is dependent on resolving the first --- unlike set theory. In set theory one can derive ordered set from unordered set, but because resolving the name of the second component depends on the first component one cannot do so in this naming system. For this reason it can well be argued that this naming system is not truly set theory based. Now that I have mentioned this difference I will start to call them grouping and ordering, rather than unordered and ordered set. These two primitives take other names as sub-names, and allow the user to construct compound names. Either the order of the subnames is significant (ordering), or it isn't (grouping), and thus we have the two different primitives. Because I have myself found that BNFs are easier to read if preceded by examples, I will first list progressively more complex examples using the naming system, and then formally define. The examples, and the simplified syntax, use / rather than : or \, but this is of no moment. Examples /etc/passwd Ordering and grouping are not just better; file system upward compatibility makes them cheaper for unifying naming in OSes based on hierarchical file systems than a relational naming system would be. This approach is fully upwardly compatible with the old file system. Users should be able to retain their old habits for as long as they wish, engage in a slow comfortable migration, and incorporate the new features into their habits as they feel the desire. Elderly programs should be untroubled in their operation. Many worthwhile projects fail because they emphasize how much they wish to change rather than asking of the user the minimal collection of changes necessary to achieve the added functionality. [dragon gandalf bilbo] FIGURE 3. Graphical representation of ascii name on left Mr. B. Bizy looking for a dimly remembered story ( The Hobbit by Tolkien ) to print out and take with him for rereading during the annual company meeting. case-insensitive/[computer privacy laws] FIGURE 4. Graphical representation of ascii name on left When one subname contains no information except relative to another subname, and the order of the subnames is essential to the meaning of the name, then using ordering is appropriate. This most commonly occurs when syntax barriers are crossed. This is when a single compound name makes a transition from interpreting a subname according to the rules of one syntax to interpreting it according to the rules of another syntax. Ordering is essential at the boundary between the name of the new syntax as expressed in the current syntax, and the name to be interpreted according to that new syntax. Some researchers use the term context rather than syntax. The pairing of a program or function name, and the arguments it is passed, is inherently ordered. While that is usually the concern of the shell, when we use a variety of ordering functions to sort Key_Objects of different types it affects the object store. In this example the ordering serves as a syntax barrier. Case-insensitive is the unabbreviated name of a directory that ignores the distinction between upper and lower case. For Linux compatibility this naming layer is case sensitive by default, even though I agree with those who think that it would be better were it not. [my secrets]/ [love letter susan] FIGURE 5. Graphical representation of ascii name on left Devhuman (that's the account name he chose) is the company's senior programmer. Six years ago he wrote a love letter to Susan, which he put in his read protected secrets directory. (He never found the nerve to send it to her.) He's looking for it so he can rewrite it, and then consider sending it. Security is a particular kind of syntax barrier (you have to squint a bit before you can see it that way). Here the ordering serves as a security barrier. (He certainly wouldn't want anyone to know that an object owned by him with attributes love letter susan existed.) [subject/[illegal strike] to/elves from/santa document-type/RFC822 ultimatum] FIGURE 6. Graphical representation of search for santa's ultimatum Devhuman knows his object store cold. He is looking for something he saw once before, he knows that it was auto-named by a particular namer he knows well (perhaps one whose functionality is similar to the classifier in [Messinger]), and he knows just what categorizations that namer uses when naming email. Still, he doesn't quite remember whether the word 'ultimatum' was part of the subject line, the body, or even was just elvish manual supplementation of the automatic naming. Rather than craft a query carefully specifying what he does and does not know about the possible categorizations of ultimatum, he lazily groups it. If Devhuman's object store is implemented using this naming system with good style, someone less knowledgeable about the object store would also be able to say: [santa illegal strike ultimatum elves ] and perhaps get some false hits as well as the desired email (instead of finding mail from santa perhaps finding the elvish response). Notice that if you delete the 'illegal' and 'ultimatum' to get [subject/strike to/elves from/santa document-type/RFC8221 the query is structurally equivalent to a relational query. Many authors (e.g. semantic database designers) have written papers with good examples of standard column names which might be worth teaching to users. So long as they are an option made available to the user rather than a requirement demanded of the user, the increased selectivity they provide can be helpful. [_is-a-shellscript bill] FIGURE 7. Graphical representation of ascii name on left This name finds all shellscripts associated with bill. Names preceded by _ are pruners. Pruners are analogous to the predicate evaluators of relational database theory. If you have read papers distinguishing between recognition and retrieval, pruners are a recognition primitive. They are passed a list of objects, and return a subset of that list which matches some criteria. They are a mechanism appropriate for when a nonlinear search method that can deliver the desired functionality is either impossible, or not supported by existing indexes. There are many names for which we cannot do better than linear time search algorithms (perhaps simply as a result of incomplete indexing.) that are useful. _is-a-shellscript checks each member of its list to see if it is an executable object containing solely ascii. The user can use it just like any other Key_Object within an association, it will prune the results of the grouping. Since set intersections are commutative its order within the grouping has no meaning, and optimizers; are free to rearrange it. The Formal Definitions <Object Name>::= <Grouping> | <Ordering> | <Key_Object> | <Storage_Key> | <Orthogonal and Unoriginal Primitives I Will Not Define Here> | ; See the section listing orthogonal and unoriginal primitives for a discussion of what primitives I left out of the definitions of this grammar that are necessary to a real world working system. The name resolver has a method for converting all of the primitives into <Storage_Keys>, and when processing the compound names it first converts the subnames into <Storage_Keys>, though the object may have null contents, and serve purely to embody structure. This allows the use of anything which anyone can invent a way of allowing the user to find a <Object Name> for, and then invent a method for the resolver to convert the <Object Name> into a <Storage_Key>, as a component of a grouping or ordering. In a word, closure. Extensible closure. Compound names are interpreted by first interpreting the subnames that they are constructed from. At each stage of subname interpretation an <Object Name> is converted into a <Storage_Key> for the object that it is resolved to. The modules that implement the grouping and ordering primitives do not interpret the subnames, they merely pass them to the naming system which returns the <Storage_Key>s they resolve to. It was a long discussion which led to the use of storage keys rather than objectids. A storage key differs from an objectid in that it gives the storage layer directions as to where to try to locate the object in the logical tree ordering of the storage layer. If the logical location changes, then in the worst case we leave a link behind, and get an extra disk access like we get with an inode. (Inode numbers are functionally objectids) In the better case, the repacker eventually comes along, and changes all references by key to the new location, at least for all objects that have not given their key to external naming systems the repacker cannot repack.. A <Storage_Key> is assigned by the system at object creation, and serves the purpose of allowing the system to concisely name the object, and provide hints to the storage layer about which objects should be packed near each other. The user does not directly interact with the <Storage_Key> any more often than C programmers hardcode pointers in hex. The packing locality of keys may be redefined. The Primitives <Key_Object> A description of the contents of an object using the syntax of the current directory. For objects used to embody keywords this may be the keyword in its entirety. If it contains spaces, etc. it must be enclosed in quotes. Note that making it easy for third parties to add plug-in directory types is part of Namesys's current contract with Ecila. Ecila wants space efficient directories suitable for use in implementing a term dictionary and its postings files for their Internet search engine. Example: [reindeer chimneys presents man] In this 'presents', 'reindeer', 'chimneys', and 'man' are the contents of objects associated with the Santa Claus story. Each of them is searched for by contents, and then when found they are converted into their Storage_Keys, and then the grouping algorithm is fed their three Storage_Keys. The grouping module then looks in the object headers of the three objects, gets the three sets of objects the Key_Objects group to, and performs a set intersection. Besides greater closure, another advantage of storing Key_Objects as objects is that non-ascii Key_Objects and ordering functions can be implemented as a layer on top of the ascii naming system, allowing the user to interact with the naming system by pressing hyperbuttons, drawing pictures, making sounds, and supplying other non-ascii Key_Objects that the higher layers convert into Storage_Keys. There are endless content description techniques, if the directory owner supplies an ordering function for the Key_Objects in a directory, one can generate a search index for the directory using an directory plug-in which is fully orthogonal to the ordering function, though perhaps slower in some cases than one that is tailored for the ordering function. Users will find it easier to write ordering functions than index creation objects, and will not always need the speed of specialized indexes. We will need one ordering function for ascii text, another for numbers, another for sounds, perhaps someday one even for pictures of faces (perhaps to be used by a law enforcement agency constructing an electronic mug book, or a white pages implementation), etc. No system designer can provide all the different and sometimes esoteric ordering functions which users will want to employ. What we can do is create a library of code, from which users can construct their own ordering function and their own directory plug-ins, and this is the approach we are taking on behalf of Ecila. For an Internet search engine one wants what is called a postings file, which is like a directory in that there is no need to support a byte offset, and one frequently wants to efficiently perform insertions into it. <Grouping> ::= [<Unordered List>] ; <Unordered List> ::= <Unordered List> <Unordered List> |<Object Name> |<Pruner> ; <Pruner> ::= _<Object Name> A <Grouping> is a list of object names and pruners whose order has no meaning. Every object has a list of objects it groups to (associates with in neural network idiom) in its object header. A grouping is interpreted by performing a set intersection of those lists for every object named in the grouping. In the sense of the data model, the interpretation of a grouping is interpreted by performing what is in the sense of the data model a set vicinity intersection. Grouping is not transitive: [A] => B and [B] => C does not imply [A] => C though it does imply that [[A]] => C A pruner is an <Object Name> which has been preceded with an _ to indicate that the object described should be passed a list of objects named by the rest of the grouping, executed, and it will return a subset of the list it was passed. Whether a member of the set is in the returned subset must be fully independent of what the other members were of the set, or else the results become indeterminate after application of a query optimizer , as with an optimizer in use there is no guarantee provided of the order of application of the pruners. <Ordering> ::= <Object Name>/<Object Name> | <Object Name>/<Custom Programmed Syntax> <Custom Programmed Syntax> ::= Varies, provides extensibility hook. An ordering is a pairing of names, with the order representing information. The first component of the ordering determines the module to which the second component is passed as an argument. In contrast, a grouping first converts all subnames to Storage_Keys by looking through the same current directory for all of them in parallel, and then does its set intersection with the subdescriptions already resolved. Example: In resolving [my secrets] / [love letter susan] the system would look for the objects with contents my and secrets, find both of them and do a set intersection of all of objects those two objects both group to (are associated with). This will allow it to find the [my secrets] directory, inside of which it will look for the three objects love, letter, and susan. It will then extract from their object headers the sets of objects those three words ('love', 'letter', and 'susan') group to, and do a set intersection which will find the desired letter. The desired letter is not necessarily inside the [my secrets] directory, though in this case it probably is. A directory is an object named by the first component of an ordering, to which the second component is passed, and which returns a set of Storage_Keys. One can in principle use different implementations of the same directory object without impacting the semantics and only affecting performance, as is often done in databases. There are flavors of directories: Custom programmed directories, aka filters, are any executable program that will return a Storage_Key when executed and fed the second component as an argument. They provide extensibility. (They are the ordered counterpart of pruners.) Another term for them is filter directories. Custom programmed directories whose name interpretation modules aren't unique to them will contain just the name of the module (filter), plus some directory dependent parameters to be passed to the module. It should be considered merely a syntax barrier directory, and not a fully custom programmed directory, if those parameters include a reference to a search tree that the module operates on, and if that search tree adheres to the default index structure. The connotations conveyed by the term 'filter' of there being an original which is distorted are not always appropriate, but in honesty this is not an issue about which we deeply care. Syntax barrier directoriesallow you to describe the contents of the object they contain with a syntax different from their parents. Except for being sorted by a different ordering function, the indexes of syntax barrier directories are standard in their structure, and use a standard index traversal module. The index traversal module is ordering function independent. There must be an ordering function for every <Key_Object> employed within a given syntax barrier directory. By contrast, a <Custom Programmed Syntax> could be anything which the syntax module somehow finds an object with, possibly even creating the object in order to be able to find it. To cross a security barrier directory the user must use an ordered pair of names with the security barrier as the first member of the pair, and he must satisfy the security module of the secured directory. A security barrier directory may be both a security and a syntax barrier directory, or the security barrier directory may share the syntax module of its parents. Fully standard directories are those built using the default directory module, and adding structure is their only semantic effect. There is an aspect of customization which is beyond the scope of this paper, in which one customizes the items employed by the storage layer to implement files and directories. That is, the storage of the files and directories are implemented by composing them of items, and these items have different types. We are now creating the code for packing and balancing arbitrary types of items using item handlers and object oriented balancing code, so as to make it easier to extend our filesystem. Ordering can be implemented more efficiently than grouping The set intersections performed in evaluating the grouping primitive are normally much more expensive computationally than performing the classical filesystem lookup. Imposing excess structure on one's data does not just at times reduce the cost of human thinking :-), it can be used to reduce the cost of automated computation as well. When the cost to a user of learning structure is less important than the burden on the machine, use of highly ordered names is often called for. The Motivation for Different Syntactic Treatment of Ordering and Grouping, and Some of the Deeper Issues Revealed by the Difference. An important difference between grouping and ordering affects syntax. It allows us to represent an ordering with a single symbol ( '/') placed between the pair, but requires two symbols ( '[' and ']' ) for each grouping. Imagine using < and > as a two symbol delimiter style alternative notation for ordering: <<father-of mother-of>sister-of> = <father-of<mother-of sister-of> > = <father-of mother-of sister-of> = father-of /mother-of /sister-of All of the expressions above are equivalent in referring to the paternal great aunt of the person who is the current context. The ones using nested pairs of symbols to enclose pairs of subnames imply a false structure that requires the user to think to realize the first two expressions are equivalent. The fourth is the notation this naming system employs. Grouping is different: Fast Acting Freddy is looking through the All-LA Shopping Database for a single store with black reebok sneakers, a green leather jacket, and a red beret so that he can dress an actor for a part before the director notices he forgot all about him. [[black reebok sneakers] [green leather jacket] [red beret]] is not equivalent to [black reebok sneakers green leather jacket red beret] which equals [red sneakers black reebok jacket green beret] Ordering is not algebraically commutative (father-of/mother-of is not equivalent to mother-of/father-of ). Groupings are algebraically commutative. ([large red] = [red large]) Style As a general principle, a more restricted system can avoid requiring the user to repeatedly specify the restrictions, and if the user has no need to escape the restrictions then the restricted system may be superior. This is why "4GLs", which supply the structure for the user's query, are useful for some applications. They are typically implemented as layers on top of unrestricting systems such as this one. This paper has addressed issues surrounding finding information, particularly when the user's clues are faint. When supporting other user goals, such as exploring information, adding structure through substantial use of ordering can be helpful. [Marchionini][McAleese]. When the user goal is finding, one should assume that of all the fragments of information about an object, the user has some random subset of them. The goal is to allow the user to use that random subset in a name, whatever that subset might be. Some of that subset will be structural fragments. While requiring the user to supply a structure fragment is as foolish as requiring him to supply any other arbitrary fragment, allowing him to is laudable. In the best of all worlds the object store would incorporate all valid possible structurings of Key_Objects. The difficulty in implementing that is obvious. [Metzler and Haas] discuss ways of extracting structure from English text documents, and why one would want to be able to use that structure in retrievals. Unfortunately, there is an important difference between representing the structure of an English language sentence in a way that conveys its meaning, and representing it in a way that allows it to be found by someone who knows only a fragment of its semantic content. I doubt the wisdom of trying to advocate the use of more than essential structure in searching. You can allow users to avoid false structure; you cannot force them to. It is important to teach those creating the structure that if they group a personnel file with sex/female they should also group it with female. Type checking can impose structure usefully. Its implementation can enhance or reduce closure, depending on whether it is done right. When To Decompound Groupings There are dangers in excessive compounding of compound groupings analogous to those of excessive ordering. Let's examine two examples of compound groupings, both of which are valid both semantically and syntactically. One of them can be "decompounded" with moderate information loss, and the other loses all meaning if decompounded. Example: Finding a loquacious Celtic textbook salesman who told you in excruciating detail about how he was an ordinance researcher until one day he went to a Grateful Dead concert. [[Celtic textbook salesman] [ordinance researcher]] vs. [celtic textbook salesman ordinance researcher] These two phrasings of the same query are not equivalent, but they are "close." Our second example is the one in which Fast Acting Freddy tries to find a suspect by the objects he is associated with: [[black reebok sneakers] [green leather jacket] [red beret]] vs. [black reebok sneakers green leather jacket red beret] These two are not at all "close." The difference between the two examples of inequivalencies is that the subdescriptions within the second example describe objects whose existence within the object store independent of the store described is worthwhile. The first does not, and it is more reasonable to try to design so that the "decompounded" version of the query is used. False hits will occur, but for large systems that's better than asking the user to learn structure. A higher level user interface might choose to present only one level to the user at a time, and then once the user confirms that a subdescription has resolved properly it would let him incorporate it into a higher level description. There might be 6 models of [black reebok sneakers], and Fast Acting Freddy should have the opportunity to click his mouse on the exact model, and have the interface substitute that object for his subdescription. Using such an interface an advanced user might simultaneously develop several subdescriptions, refine and resolve them, and then use the mouse to draw lines connecting them into a compound grouping. Closure makes it possible for that to work. Examples of Creating Associations <- creates an association between all of the objects on the left hand side and all of the objects on the right hand side. A - B is the set difference of A and B, and it resolves to the set of objects in A except for those that are in B. A & B resolves to the set intersection of A and B, the object that are both in A and B. [A B] = [A] & [B], by definition. animal <- (lives, moves) mammal <- ([animal], animal, `warm blooded') cat <- ([mammal], hypernym/mammal, mammal, meronym/fur, fur, meronym/whiskers, whiskers, hypernym/quadruped, quadruped, capability/purr, purr, capability/meow, meow) Basil <- (owner/Nina, Nina, [siamese], siamese, clever, playful, brave/overly, brave, 'toilet explorer') bag <- ([container], container, consists-of/`highly flexible material', `highly flexible material') backpack <- ([bag], shoulderstrap/quantity/2, shoulderstap, college-student, holonym/backpacker, meronym/shoulderstrap) mould <- ([fungi] - green/not, furry, `grows on'/surfaces/moist, `killed by'/chlorine) fungi <- ([plant], plant, leaves/no, flowers/no, green/not) bird <- ([vertebrate], vertebrate, flies, feathers) penguin <- ([bird] - flies, bird, hypernym/bird, swims, Linux, [Linux (mascot, symbol)]) siamese <- ([cat], cat, hair/short, short-hair) Notice how we don't associate siamese with short despite associating it with hair/short, but we do associate Basil with Nina as well as with owner/Nina. small <-0 little The above means that small and little are synonyms, and are to be treated as 0 distance away from each other for vicinity calculation purposes. In other, traditional Unix, words, they are hardlinked together. Creating a serious ontology is not our field or task, but worth doing. The reader is referred to WordNet (free), and Cyc by Doug Lenat (proprietary). While we will focus on implementing primitives that allow for creating better ontologies, we are happy to work with persons interested in contributing or porting an ontology. Other Projects Seeking To Increase Closure In The OS ATT's Plan 9 [Plan 9] is being produced by the original authors of Unix at ATT research labs. It has influenced CORBA, and /proc is a direct steal from it to Linux. Their major focus is on integration. Their major trick for increasing integration is unifying the name space. Name spaces integrated into the Plan 9 file system include the status, control, virtual memory, and environment variables of running processes. They have a hierarchical analog to what the relational culture calls constructing views, that the Plan 9 culture calls context binding. Microsoft's Information At Your Fingertips Plan 9 ignores integration of application program name spaces, concentrating on OS oriented name spaces. Microsoft's "Information at Your Fingertips" name space integration effort appears to be taking the other approach, and focusing on integrating the name spaces of the various Microsoft applications via OLE and Structured Storage. The application group at Microsoft has long been better staffed and funded than the OS group, and FS developers have long preferred to simply ignore the needs of application builders generally. The primary semantic disadvantages of Microsoft's approach are primitives selected with insufficient care, a lack of closure, and the use of an object oriented rather than set oriented approach in both naming syntax and data model. Realistically, one can say that folks within Microsoft have often made statement favoring name space integration, and in various areas have successfully executed on it, but on the whole I rather suspect that the lack of someone in marketing making a business case for $X in revenue resulting from name space integration has crippled name space integration work at commercial OS producers generally, including MS. Internet Explorer Internet Explorer attempts to unify the filesystem and Internet namespaces. At the time of writing, the unity is so surface, with so little substance, that I would describe it as having the look and feel of integration without most of the substance. Perhaps this will change. Microsoft's Well Known Performance Difficulties Despite having many of the leading names in the industry on their payroll, they have somehow managed to create a file system implementation with performance so terrible that it is for the Unix customer base a significant consideration contributing to hesitation in moving to NT. It may well have the worst performance of any of the major OS file systems. Their implementation of OLE's structured storage offers extremely poor performance, and their excuse that it is due to the incorporation of transaction concepts into their design is just a reminder that they did a poor job at that as well. That they managed to implement something intended to store small objects within a file, and implement it such that it still suffers from 512-byte granularity problems, problems that they try to somewhat overcome by encouraging the packing of several objects within "storages" at horrible kludge costs.... Storage Layers Above the FS: A Sure Symptom The FS Developer Has Failed When filesystems aren't really designed for the needs of the storage layers above them, and none of them are, not Microsoft's, not anybody's, then layering results in enormous performance loss. The very existence of a storage layer above the filesystem means that the filesystem team at an OS vendor failed to listen to someone, and that someone was forced to go and implement something on their own. You just have to listen to one of these meetings in which some poor application developer tries to suggest that more features in the FS would be nice, I heard one at a nameless OS vendor. The FS team responds to say disks are cheap, small object storage isn't really important, we haven't changed the disk layout in 10 years, and changing it isn't going to fly with the gods above us about whom we can do nothing. At these meetings you start to understand that most people who go into filesystem design are persons who didn't have the guts to pursue a more interesting field in CS. There is a sort of reverse increasing returns effect that governs FS research, in which the more code becomes fixed on the current APIs, the more persons in the field who react with fear to any thought of the field of FS semantics being other than a dead research topic, the less research gets done, and the fewer persons of imagination see a reason to enter the field. Every time one vendor gets a little forward in adding functionality, the other vendors go on a FUD campaign about it breaking standards and therefore being dangerous for mission critical usage. This is a field in which only performance research is allowed, and every other aspect is simply dead. Namesys seeks to raise the dead, and is willing to commit whatever unholy acts that requires. There is no need for two implementations of the set primitive, one called directories, the other called a file with streams, each having a different interface. File systems should just implement directories right, give them some more optional features, and then there is no need at all for streams. If you combine allowing directory names to be overloaded to also be filenames when acted on as files, allowing stat data to be inherited, allowing file bodies to be inherited, and implement filters of various kinds, then in the event that the user happens to need the precise peculiar functionality embodied by streams, they can have it by just configuring their directory in a particular way. There was a lengthy Linux-kernel thread on this topic which I won't repeat in more detail here. The tree architecture of the storage layer of this FS design will lend itself to a distributed caching system much more effectively than the Microsoft storage layer, in part due to its ability to cache not just hits and misses of files, but to cache semantic localities (ranges). For more on this topic see later in this paper. Rufus The Rufus system [Messinger et al.] indexes information while leaving it in its original location and format. While it does allow the user to create a unified name space, it does not choose to integrate that name space into the operating system. Even so, it is immensely useful in practice, and strongly hints at what the OS could gain if it had a more than hierarchical name space with a data model oriented towards what [Messinger] calls " semi-structured information.", such as you find in the RFC822 format for email. When you have 7000 pieces of mail, and linear searching the mail with a utility like grep takes 10 minutes, it is nice to be able to quickly keyword search via inverted indexes for the mail whose from: field contains billg and that has the words "exclusive" and "bundling" in the body of the message, as you hurriedly search for an old email just before an appearance at court. Semantic File System The Semantic File System comes closest to addressing the needs I have described. It is a Unix compatible file system with more than hierarchical naming (attribute based is the term they use). Its data model unfortunately has the important flaw of lacking closure (in it names of objects are not themselves objects). In my upcoming discussion of the unnecessary lack of closure in hypertext products, notice that the arguments apply to the Semantic File System (and so I won't duplicate them here.). OS/400 IBM's OS/400 employs a unified relational name space. The section of this paper entitled A System Should Reflect Rather than Mold Structure will cover its problems of forcing false structure. Inadequate closure due to mandatory type checking is another source of difficulties for it. While users moan about these two unnecessary design flaws, the essence of the opinions AS/400 partisans have expressed to me has been that the unification of its name space is a great advantage that OS/400 has over Unix. I claim these users were right, and later in this paper will propose doing something about it. Conclusion While I spent most of this paper on why adding structure to information can be harmful, particularly when it is intended to be found by others sifting through large amounts of other information, this was purely because it is a harder argument than why deleting structure is harmful. My goal was not to be better at unstructured applications than keyword systems, or better at structured applications than the hierarchical and relational systems --- the goal is to be more flexible in allowing the user to choose how structured to be, while still being within a single name space. I claimed that multiple fragmented name spaces cannot match the power and ease of name spaces integrated with closure: closure makes a naming system far more powerful by increasing its ability to compound complex descriptions out of simpler ones. The strong points of this naming system's design are various forms of generalizing abstractions already known to the literature, for greater closure. Acknowledgments David P. Anderson and Clifford Lynch helped enormously in rounding out my education, and improving my paper. Their generosity with their time was remarkable. David P. Anderson was simply a great professor, and it was a privilege to work with him. Brian Harvey informed me that it wasn't too obvious to mention that an object store should be unified. Cimmaron Taylor provided me with many valuable late night discussions in the early stages of this paper. I would like to thank Bill Cody and Guy Lohman of the database group at the IBM Almaden Research center for a wonderful learning experience. Vladimir Saveliev kept this file system going when others fell by the wayside. He started as the most junior programmer on the team, and through sheer hard work and dedication to excellence outshone all the other more senior researchers. Of course after some time he could no longer be considered a junior programmer. NOTE: See also the DARPA funded, but not endorsed, Reiser4 Transaction Design Document and Reiser4 Whitepaper. References 1. Blair, David C. and Marron, M. E. "Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System" Communications of the ACM v 28 n 3 Mar 1985 p289-299 2. Codd, E. F. "The Relational Model for Database Management: version 2" c1990 Addison-Wesley Pub. Co., not recommended as a textbook, Date's is better for that, but worthwhile if you want a long paper by Codd. Notice that he places greater emphasis on closure, and design methodology principles in general, than designers of other naming systems such as hypertext. 3. Date, C.J. "An Introduction to Database Systems", 4th ed. Reading, Mass.: Addison-Wesley Pub. Co., c1986- Contains a well written substantive textbook sneer at the problems of hierarchical naming systems, and a well annotated bibliography. 4. Curtis, Ronald and Larry Wittie "Global Naming in Distributed Systems" IEEE Software July 1984 p76-80 5. Feldman, Jerome A., Mark A. Fanty, Nigel H. Goddard and Kenton J. Lynne, "Computing with Structured Connectionist Networks." Communications of the ACM, v31 Feb '88, p170(18) 6. Fox, E. A., and Wu, H. Extended Boolean Information Retrieval, Communications of the ACM, 26, 1983, pp. 1022-1036 7. Gallant, Stephen I., "Connectionist Expert Systems", Communications of the ACM, v31 Feb '88, pl52(18) 8.Gates, Bill. Comdex '91 speech on "Information at Your Fingertips" available for $8 on videotape from Microsoft's sales department. 9. Gifford, David K., Jouvelot, Pierre., Sheldon, Mark A., O'Toole, James W. Jr., "Semantic File Systems", Operating Systems Review Volume 25, Number 5, October 13-16, 199 1, They demonstrated that extending Unix file semantics to include nonhierarchical features is useful and feasible. Unfortunately, their naming system lacks closure. 10.Gilula, Mikhail. "The Set Model for Database and Information Systems", 1st Edition, c 1994, Addison-Wesley, provides a Set Theoretic Database Model in which relational algebra is a shown to be a special case of a more general and powerful set theoretic approach. 11. Joint Object Services Submission (JOSS), OMG TC Document 93.5.1 12.Marchionini, Gary., and Shneiderman, Ben. "Finding Facts vs. Browsing Knowledge in Hypertext Systems." Computer, January 1988, p. 70 13. McAleese, Ray "Hypertext: Theory into Practice" edited by Ray McAleese, ABLEX Publishing Corporation, Norwood, NJ 07648 14.Messinger, Eli., Shoens, Kurt., Thomas, John., Luniewski, Allen "Rufus: The Information Sponge" Research Report RJ 8294 (75655) August 13, 1991, IBM Almaden Research Center 15. Metzler and Haas. "The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval", Proceedings of the ACM SIGIR Conference, 1989, ACM Press, 16.Nelson, T.H. Literary Machines, self published by Nelson, Nashville, Tenn., 198 1, did much to popularize hypertext, at the time of writing he has still not released a working product, though competitors such as hypercard have done so with notable success. 17. Mozer, Nfichael C. "Inductive Information Retrieval Using Parallel Distributed Computation", UCLA 18.Pike, Rob and P.J. Weinberger ... The Hideous Name" AT&T Research Report" 19.Pike, Rob., Presotto, Dave., Thompson, Ken. Trickey, Howard., Winterbottom, Phil. "The Use of Name Spaces in Plan 9", available via ftp from att.com, Plan 9 is an operating system intended to be the successor to Unix, and greater integration of its name spaces is its primary focus. 20. Potter, Walter D. and Robert P. Trueblood, "Traditional, semantic, and hyper-semantic approaches to data modeling" v21 Computer '88 p53(1 1) 21. Rijsbergen, C. J. Van, Information Retrieval - 2nd. ed., Butterworth and Co. Ltd., 1979, Printed in Great Britain by The Whitefriars Ltd., London and Tonbridge 22. Salton, G. (1986) Another Look At Automatic Text-Retrieval Systems, Communications of the ACM, 29, 648-656 23. Smith, J.M. and D.C. Smith, "Database Abstractions: Aggregation and Generalization" ACM Transactions Database Systems, June 1977, pp. 105-133 ICS Report No. 8406 June 1984 24 http://www.win.tue.nl/~aeb/partitions/partition_types.html Last modified: Thu Jan 25 14:13:32 MSK 2001 (maintained by reiser@namesys.com). HOME PAGE [[category:Reiser4]] 9901d35fd14fccc9080deb57b58b0779c1f5a1e5 IRC 0 10 1300 2009-06-25T06:24:27Z Chris goe 2 Created page with 'There is an IRC channel for <tt>reiser4</tt> and <tt>reiserfs</tt> discussion on [http://www.oftc.net OFTC]: [irc://irc.oftc.net:6667/#reiser4 #reiser4 on irc.oftc.net] [[cat...' There is an IRC channel for <tt>reiser4</tt> and <tt>reiserfs</tt> discussion on [http://www.oftc.net OFTC]: [irc://irc.oftc.net:6667/#reiser4 #reiser4 on irc.oftc.net] [[category:ReiserFS]] [[category:Reiser4]] db728f2ef1959386579ee6dbd937b4e0a2651a55 Logical Volumes Administration 0 1110 4461 4459 2022-07-13T07:10:33Z Edward 4 '''WARNING! FSCK doesn't work correctly on volumes with scaling-out abilities. Any attempts to repair such volumes will result in data loss''' Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. When building the kernel, make sure that Reiser4-specific configuration option "Enable Plan-A key allocation scheme" is '''disabled''', or check that the .config file contains the following line: # CONFIG_REISER4_OLD is not set Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time you can change abstract capacity of any brick to some new value, different from 0. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = '''What happens if I lose a brick (due to a breakdown, etc) of my logical volume?''' If you lose a meta-data brick, then you lose the whole volume. If you loose a data brick, then you'll be able to mount the volume, but bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. '''I am not able to complete brick removal because of "no space left on device". What should I do?''' At the very beginning of removal operation the file system calculates precise amount of space on other bricks, needed to accommodate all the data of the brick you are going to remove. It could happen, however that free space on the bricks was near the calculated amount, and during the removal, exceeded it due to intensive regular file operations (caused e.g. by torrents activity) going in parallel. So, to complete the operation consider removing a part of regular files on your partition. In the future it will be possible to add a brick to a volume marked as a "volume with incomplete brick removal". It will also allow to complete removal operation. [[category:Reiser4]] b667d8de4462ea2a5e9fd7a50f9a724d46bb57b5 4459 4458 2022-07-12T20:07:39Z Edward 4 Add warning about fsck '''WARNING! FSCK doesn't work correctly on volumes with scaling out abilities. Any attempts to repair such volumes will result in data loss''' Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. When building the kernel, make sure that Reiser4-specific configuration option "Enable Plan-A key allocation scheme" is '''disabled''', or check that the .config file contains the following line: # CONFIG_REISER4_OLD is not set Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time you can change abstract capacity of any brick to some new value, different from 0. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = '''What happens if I lose a brick (due to a breakdown, etc) of my logical volume?''' If you lose a meta-data brick, then you lose the whole volume. If you loose a data brick, then you'll be able to mount the volume, but bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. '''I am not able to complete brick removal because of "no space left on device". What should I do?''' At the very beginning of removal operation the file system calculates precise amount of space on other bricks, needed to accommodate all the data of the brick you are going to remove. It could happen, however that free space on the bricks was near the calculated amount, and during the removal, exceeded it due to intensive regular file operations (caused e.g. by torrents activity) going in parallel. So, to complete the operation consider removing a part of regular files on your partition. In the future it will be possible to add a brick to a volume marked as a "volume with incomplete brick removal". It will also allow to complete removal operation. [[category:Reiser4]] d2f365635b203a92d2ebb814d49db38db3c16379 4458 4435 2021-10-07T22:58:07Z Edward 4 /* Prepare Software and Hardware */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. When building the kernel, make sure that Reiser4-specific configuration option "Enable Plan-A key allocation scheme" is '''disabled''', or check that the .config file contains the following line: # CONFIG_REISER4_OLD is not set Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time you can change abstract capacity of any brick to some new value, different from 0. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = '''What happens if I lose a brick (due to a breakdown, etc) of my logical volume?''' If you lose a meta-data brick, then you lose the whole volume. If you loose a data brick, then you'll be able to mount the volume, but bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. '''I am not able to complete brick removal because of "no space left on device". What should I do?''' At the very beginning of removal operation the file system calculates precise amount of space on other bricks, needed to accommodate all the data of the brick you are going to remove. It could happen, however that free space on the bricks was near the calculated amount, and during the removal, exceeded it due to intensive regular file operations (caused e.g. by torrents activity) going in parallel. So, to complete the operation consider removing a part of regular files on your partition. In the future it will be possible to add a brick to a volume marked as a "volume with incomplete brick removal". It will also allow to complete removal operation. [[category:Reiser4]] 02f231686b8dc8e9c474e1c352da4b0f178120e1 4435 4434 2020-11-12T12:56:15Z Edward 4 /* FAQ */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time you can change abstract capacity of any brick to some new value, different from 0. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = '''What happens if I lose a brick (due to a breakdown, etc) of my logical volume?''' If you lose a meta-data brick, then you lose the whole volume. If you loose a data brick, then you'll be able to mount the volume, but bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. '''I am not able to complete brick removal because of "no space left on device". What should I do?''' At the very beginning of removal operation the file system calculates precise amount of space on other bricks, needed to accommodate all the data of the brick you are going to remove. It could happen, however that free space on the bricks was near the calculated amount, and during the removal, exceeded it due to intensive regular file operations (caused e.g. by torrents activity) going in parallel. So, to complete the operation consider removing a part of regular files on your partition. In the future it will be possible to add a brick to a volume marked as a "volume with incomplete brick removal". It will also allow to complete removal operation. [[category:Reiser4]] 7658b5764091e5b2a256c5f00fae6c0b206c15fa 4434 4433 2020-11-12T12:54:08Z Edward 4 /* FAQ */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time you can change abstract capacity of any brick to some new value, different from 0. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = '''What happens if I lose a brick (due to a breakdown, etc) of my logical volume?''' If you lose a meta-data brick, then you lose the whole volume. If you loose a data brick, then you'll be able to mount the volume, but bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. '''I am not able to complete brick removal because of "no space left on device". What should I do?''' At the very beginning of removal operation the file system calculates precise amount of space on other bricks, needed to accommodate all the data of the brick you are going to remove. It could happen, however that free space on the bricks was near the calculated amount, and during the removal, exceeded it due to intensive regular file operations (caused e.g. by torrents activity) going in parallel. So, to complete the operation consider removing a part of regular files on your partition. In the future it will be possible to add a brick to a volume with incomplete brick removal. It will also allow to complete removal operation. [[category:Reiser4]] 75857f3e9ec1f3bdc7d33170b210358dcdaf4faa 4433 4432 2020-11-12T12:40:31Z Edward 4 /* FAQ */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time you can change abstract capacity of any brick to some new value, different from 0. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = '''What happens if I lose a device-component (due to a breakdown, etc) of my logical volume?''' Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. '''I am not able to complete brick removal because of "no space left on device". What should I do?''' At the very beginning of removal operation the file system calculates precise amount of space on other bricks, needed to accommodate all the data of the brick you are going to remove. It could happen, however that free space on the bricks was near the calculated amount, and during the removal, exceeded it due to intensive regular file operations (like torrents activity) going in parallel. So, to complete the operation consider removing a part of regular files on your partition. In the future it will be possible to add a brick to a volume with incomplete brick removal. It will also allow to complete removal operation. [[category:Reiser4]] 1194b9ed20ff779409f12189cb83517820abcc9e 4432 4431 2020-11-12T12:10:03Z Edward 4 /* Changing brick's capacity */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time you can change abstract capacity of any brick to some new value, different from 0. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] ccc1156fa761f5a14a49b4ea4561f9c0a9f46cf6 4431 4430 2020-11-12T12:08:58Z Edward 4 /* Removing a data brick from LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (assuming that your volume is not marked as a "volume with incomplete brick removal) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 361f478e418ea532ee7167110443d156c87e9771 4430 4429 2020-11-12T12:06:02Z Edward 4 /* Adding a data brick to LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (assuming that your volume is not marked as a "volume with incomplete brick removal) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] e9daf945f2455c3c634035c16cbf5489148487f1 4429 4428 2020-11-12T11:45:24Z Edward 4 /* Changing brick's capacity */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (assuming that your volume is not marked as a "volume with incomplete brick removal) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] e0f2e67e1c7b16f9a82539be903781c932840d27 4428 4427 2020-11-12T11:39:27Z Edward 4 /* Removing a data brick from LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 2ee15cdd16748b70458bf0a14768a362101f5b1a 4427 4426 2020-11-12T01:34:56Z Edward 4 /* Adding a data brick to LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 0b2bc6368688dd4e77188cf4c37feb6f2fab630d 4426 4425 2020-11-12T01:29:50Z Edward 4 /* Changing brick's capacity */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing # volume.reiser4 -b /mnt When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 600119b154f5f6d871ce867084ab95b044e22b0f 4425 4424 2020-11-12T00:55:10Z Edward 4 /* Creating a logical volume */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] cef68c60cb5e170157417aecae212c9a8836e452 4424 4423 2020-11-12T00:45:52Z Edward 4 /* Adding a data brick to LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 2793f2d50db8cfe100924502f45cd8978c3f93d1 4423 4422 2020-11-12T00:43:32Z Edward 4 /* Adding a data brick to LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick. = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] f3e989dcd1427ed143343707b2c61c88f4494955 4422 4421 2020-11-11T23:26:48Z Edward 4 /* Operations with meta-data brick */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the field "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 3be82078248220d115a5d547c19a625ec6c3c2d2 4421 4420 2020-11-11T23:25:54Z Edward 4 /* Operations with meta-data brick */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the value of "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 728bc86a98fe967e87f388fee694977ea647aa54 4420 4419 2020-11-11T23:13:51Z Edward 4 /* Volume balancing was interrupted */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the value of "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume by issuing the mount command against some its brick. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" value. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 594141f42f28b535deefce3fa3a311ee4d434124 4419 4418 2020-11-11T23:09:21Z Edward 4 /* Removing a data brick from LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. There shouldn't be, however, other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the value of "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 0526e5be7c391daf34ceaeb234962aa1a4543abd 4418 4417 2020-11-11T22:59:30Z Edward 4 /* Deploying a logical volume after correct unmount */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the value of "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 6a53699cfca908c3d6a33f9d5e07fc592b549b35 4417 4416 2020-11-11T22:41:46Z Edward 4 /* Operations with meta-data brick */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick of the volume mounted at /mnt simply run # volume.reiser4 -p0 /mnt and check the value of "in DSA". = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 8f5187658982fa6eeee7f6153f45f6b4ba702bf9 4416 4415 2020-11-11T22:30:13Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Capacity of a brick''' (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] f3352f4ff47d7b894d34e9fdd2f9bf88c2c1af2c 4415 4414 2020-11-11T22:27:27Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 0f3269d633d6dda2340272860b52274307145510 4414 4413 2020-11-11T22:25:45Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 55c9d6dba1184a5c532b3a80cfbf56451dbfd58c 4413 4412 2020-11-11T22:23:02Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''Real data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 9757ba5aae773b6c709d5b89f73548c6b3047c06 4412 4411 2020-11-11T22:16:14Z Edward 4 /* LV monitoring */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 8adb1f9e571dcedabd9d25aaf9ac8da504483715 4411 4410 2020-11-11T22:15:12Z Edward 4 /* LV monitoring */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] ad602845beadbdc36cccd33b970c455f577ef03b 4410 4409 2020-11-11T22:14:01Z Edward 4 /* Deploying a logical volume after hard reset or system crash */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Volume balancing was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the complete set of bricks and mount the volume. Check the balanced status of your LV by running # volume.reiser4 /mnt and checking "balanced" field. If the volume is unbalanced, then complete balancing by running # volume.reiser4 -b /mnt == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt and checking the value of "health". If required, complete brick removal by running # volume.reiser4 -R /mnt Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] dbfaa5171991a50dea52931810af08b541668ba0 4409 4408 2020-11-11T22:00:09Z Edward 4 /* Unmounting a logical volume */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount against the mount point: # umount /mnt Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 7deb05b88e7e211c68c956e9ee221cba2330a81e 4408 4407 2020-11-11T21:57:46Z Edward 4 /* Removing a data brick from LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove any data brick from your LV (assuming that your volume is not marked as a "volume with incomplete brick removal". You can perform brick removal in parallel with regular file operations executing on that volume. Make sure, however, that there shouldn't be other volume operations (e.g. adding a brick) over your volume in progress, otherwise your removal will fail with EBUSY. Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair. Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state. So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal". To check removal status of your LV simply run # volume.reiser4 /mnt and check the field "health". To complete brick removal in the current mount session simply run # volume.reiser4 -R /mnt Note, that the option -R (--finish-removal) doesn't accept any arguments. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 0bb91492235242240a039b6fbc80bb98974e6c0c 4407 4406 2020-11-11T21:21:25Z Edward 4 /* Adding a data brick to LV */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair: # volume.reiser4 -b /mnt Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Specifying the option -B (--with-balance) will automatically trigger the balancing procedure after adding a brick: # volume.reiser4 -Ba /dev/vdb2 /mnt Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 79b789881fba0a93513daed5efff0fc6d88706c1 4406 4405 2020-11-11T20:58:41Z Edward 4 /* Creating a logical volume */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K: # STRIPE=512K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] edebb7e857f0a6783f3bfb308ee80f440cd27f11 4405 4404 2020-11-11T20:09:50Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state. Operation of brick removal is always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal". It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume. = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 81cf15cd025e41ad9eeaa565a19936e503d18509 4404 4403 2020-11-11T19:55:34Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. Operation of removing a brick from a logical volume is always started with balancing, which moves data from the brick you want to remove to other bricks of the volume. All volume operations (except brick removal) leave the volume unbalanced. If balancing procedure fails for some reasons, it should be completed by running volume.reiser4 utility with option -b (--balance). It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running volume.reiser4 utility with the option -R (--finish-removal). = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] a6615d7a2b4465785b03594b5343d8bf9a383495 4403 4390 2020-11-11T19:26:41Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks. The operation of removing brick from a logical volume is always accompanied with balancing, which moves data out of the brick you want to remove to other bricks of the volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file and volume operations on not balanced LV (assuming it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. On the volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete removal by running volume.reiser4 utility with the option -R (--finish-removal). = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 6e5b73741f0d58b2c69056030db444b7b7337b23 4390 4388 2020-08-16T23:31:43Z Edward 4 Before working with logical volumes you need to understand some basic [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background principles]. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. [[category:Reiser4]] 2c75776cdcdcabb576d5167b2f5bf200eea481c1 4388 4387 2020-08-16T21:52:09Z Edward 4 /* Creating a logical volume */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 48e55942218b1abf5e4e1e247bb3526beeca1e44 4387 4386 2020-08-16T18:01:42Z Edward 4 /* Deploying a logical volume after correct shutdown */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = First of all, check [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by # volume.reiser4 -l Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing configuration] somewhere, but not in this volume! And don't forget to update that configuration after '''every''' volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 2e9d977588a805cef78e009766e7a2ed4b9d6a7f 4386 4385 2020-08-16T17:31:18Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line. For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 16401d2abfd926c29d7899cc1e006c12b00f247c 4385 4384 2020-08-16T17:24:26Z Edward 4 /* Brick removal was interrupted */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need to repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 5d07e680a355012aaab110c39cc8ea672247df2c 4384 4383 2020-08-16T17:21:32Z Edward 4 /* Brick removal was interrupted */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Otherwise, check total number of bricks in your LV. if it is as before removal, then your removal operation were completely rolled back by transaction manager, so you will need repeat it from scratch. Comment. After successful balancing completion the brick will be automatically removed form the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks: # volume.reiser4 /mnt # volume.reiser4 -l Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 2bb2819aef57abed277519504114b7fd334a16e2 4383 4382 2020-08-16T17:04:23Z Edward 4 /* Adding a brick was interrupted */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 4ee84ce19218be2df7408fdd22899a1ea4ec62e5 4382 4381 2020-08-16T16:57:43Z Edward 4 /* Deploying a logical volume after correct unmount */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 844d0ba3b21adbcd9a3f6bdfb386820ac8c84ea1 4381 4380 2020-08-16T16:55:39Z Edward 4 /* Removing a data brick from LV */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. dc870b5d2b2881e973b2727485ce5d22d13c3987 4380 4379 2020-08-16T16:53:33Z Edward 4 /* Adding a data brick to LV */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration]. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 03fc1f2067ecf2f7d01dba88b862fa087ac2dce9 4379 4378 2020-08-16T16:52:05Z Edward 4 /* Adding a data brick to LV */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update item #4 of your [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Basic_definitions._Volume_configuration._Brick.27s_capacity._Partitioning._Fair_distribution._Balancing volume configuration] with UUID or name of the brick you want to add. To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. a7452aad2b8cde42309d4aa963e306f26799e7e1 4378 4377 2020-08-16T16:48:43Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 97492914a898607bac04f21fb245805997e14717 4377 4376 2020-08-16T16:13:24Z Edward 4 /* Adding a data brick to LV */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity). # mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2 Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 2f7e338faa4c25392c71d926ca291e332721705f 4376 4375 2020-08-16T15:59:54Z Edward 4 /* Creating a logical volume */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 5afc50867f74f56716b54610c804106d83bf6e1a 4375 4374 2020-08-16T15:55:06Z Edward 4 /* Creating a logical volume */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Currently there is no special option to inform mkfs utility that we want to create namely meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your initial logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 5f0779b6ad14de79b9189ae6d6d32f859b7dc511 4374 4353 2020-08-16T15:50:43Z Edward 4 /* Creating a logical volume */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K: # STRIPE=256K # echo "Using stripe size $STRIPE" Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Currently there is no special option to inform mkfs utility that we want to create "meta-data brick": the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. a732a34c17654f0768f18852ba4d67605b27bba1 4353 4352 2020-05-02T17:17:34Z Edward 4 /* LV monitoring */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 708245b8c2ffef971dfe569af739fdcbfdb40f28 4352 4350 2020-05-02T17:14:55Z Edward 4 /* LV monitoring */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc) Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 9ad11abc3e9971de5045012afcc3a42d7355f8e1 4350 4349 2020-01-05T14:43:31Z Edward 4 /* Deploying a logical volume after hard reset or system crash */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this [https://reiser4.wiki.kernel.org/index.php?title=Logical_Volumes_Administration#Deploying_a_logical_volume_after_correct_shutdown section]. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. c471b81919a1a356d4e43a27474ecedac55a43c0 4349 4348 2020-01-05T14:29:35Z Edward 4 /* Deploying a logical volume after correct shutdown */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 30498081e6070c75045add1139b631f9b652baf1 4348 4347 2020-01-05T14:23:08Z Edward 4 /* Deploying a logical volume after correct unmount */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command: # volume.reiser4 -g BRICK_NAME The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 988ef5da4a1c0a6796eb823ce0b1330ed38a274d 4347 4346 2020-01-05T14:17:36Z Edward 4 /* Deploying a logical volume after correct unmount */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. f1ce136a9200d27d0d678085024f4a8704c714d0 4346 4345 2020-01-05T14:01:54Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 55607742142f46a91a43d582e93fa0dc46c4a2aa 4345 4344 2020-01-05T13:58:39Z Edward 4 /* Removing a data brick from LV */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc). = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. cadad528f6670790a0ab7561099ac4274bc30ab8 4344 4343 2020-01-05T13:22:02Z Edward 4 Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. = FAQ = Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume? A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations. 9af44afb2679763d07302a3429c1ed63eb6a5745 4343 4342 2020-01-05T11:48:55Z Edward 4 /* Deploying a logical volume after correct shutdown */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. = Deploying a logical volume after correct shutdown = To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. 3a7bd1efc7fcf6d457f6306b0301fdae7e2c39b4 4342 4341 2020-01-05T11:46:18Z Edward 4 /* Deploying a logical volume after correct shutdown */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. 754ba38d71c0a2d5ccc847a1a501239b2ea28f8d 4341 4340 2020-01-05T11:38:27Z Edward 4 /* Deploying a logical volume after correct unmount */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. 8f358b8e23a5c4a7d4dc4e6f569455f8ab7f44e4 4340 4339 2019-12-31T13:06:22Z Edward 4 Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. cc736f9cb4aa9ad7b41155e96c120d7aa5c36049 4339 4338 2019-12-31T12:57:25Z Edward 4 /* LV monitoring */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. 428efd8b0f7b286c6892524b08cf26ed47f07362 4338 4337 2019-12-31T12:52:47Z Edward 4 /* LV monitoring */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. ed8c2642661563ca11076560c6ee350668993638 4337 4336 2019-12-31T12:07:16Z Edward 4 /* Checking quality of data distribution */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. a7dd1ca71113eda6fe4d77beb497951879f35e42 4336 4335 2019-12-31T12:02:07Z Edward 4 /* Prepare Software and Hardware */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Real space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * T = 0.6667 * 2490016 = 1660094 I2 = C2 * T = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. b24d47aa101a1b49ea73d33fd2b7954ab9d2f0af 4335 4334 2019-12-31T12:01:30Z Edward 4 /* Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing */ Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. '''Abstract capacity''' (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. '''Capacity of a logical volume''' is defined as a sum of capacities of its bricks-components. '''Relative capacity of a brick''' is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. '''(Real) data space usage''' on a brick is number of data blocks, stored on that brick. '''Ideal (or expected) data space usage''' on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Real space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * T = 0.6667 * 2490016 = 1660094 I2 = C2 * T = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. 5a663dd9c1d1d6e3c760ef714e5945e7b5539d83 4334 4333 2019-12-31T11:58:38Z Edward 4 Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. Abstract capacity (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. Capacity of a logical volume is defined as a sum of capacities of its bricks-components. Relative capacity of a brick is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. (Real) data space usage on a brick is number of data blocks, stored on that brick. Ideal (or expected) data space usage on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/ here]. Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal] Download, build and install the latest version 2.A.B of [https://sourceforge.net/projects/reiser4/files/v5-unstable/ Reiser4progs package]. Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Real space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * T = 0.6667 * 2490016 = 1660094 I2 = C2 * T = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. d0f19b5de3f9b77fd028d3797d125e4393e77e4c 4333 4332 2019-12-31T11:47:23Z Edward 4 Improve formatting Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. Abstract capacity (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. Capacity of a logical volume is defined as a sum of capacities of its bricks-components. Relative capacity of a brick is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. (Real) data space usage on a brick is number of data blocks, stored on that brick. Ideal (or expected) data space usage on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found here: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest libaal library: https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ Download, build and install the latest version 2.A.B of Reiser4progs package: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity = At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: C = 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Real space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * T = 0.6667 * 2490016 = 1660094 I2 = C2 * T = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. d1ae7332239d73287a52c263bea3533bfc5e27a4 4332 4331 2019-12-31T11:42:41Z Edward 4 Improve formatting Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. Abstract capacity (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. Capacity of a logical volume is defined as a sum of capacities of its bricks-components. Relative capacity of a brick is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. (Real) data space usage on a brick is number of data blocks, stored on that brick. Ideal (or expected) data space usage on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found here: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest libaal library: https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ Download, build and install the latest version 2.A.B of Reiser4progs package: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: == Adding a brick was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. == Brick removal was interrupted == Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. == Another volume operation was interrupted == Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Real space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * T = 0.6667 * 2490016 = 1660094 I2 = C2 * T = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. 21a099c0166e43bcb47f20981ae5213619382409 4331 4330 2019-12-31T11:34:51Z Edward 4 Improve formatting Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. Abstract capacity (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. Capacity of a logical volume is defined as a sum of capacities of its bricks-components. Relative capacity of a brick is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. (Real) data space usage on a brick is number of data blocks, stored on that brick. Ideal (or expected) data space usage on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! = Prepare Software and Hardware = Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found here: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest libaal library: https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ Download, build and install the latest version 2.A.B of Reiser4progs package: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? = Creating a logical volume = Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). = Adding a data brick to LV = At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). = Removing a data brick from LV = At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. = Changing brick's capacity At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. = Operations with meta-data brick = Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick simply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. = Unmounting a logical volume = To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME = Deploying a logical volume after correct unmount = Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. = Deploying a logical volume after correct shutdown = To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. = Deploying a logical volume after hard reset or system crash = If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: a. Adding a brick was interrupted. Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. b. Brick removal was interrupted. Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. c. Another volume operation was interrupted Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt = LV monitoring = Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. = Checking free space = To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 = Checking quality of data distribution = Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution * make sure that meta-data brick doesn't contain data blocks; * make sure that no regular file and volume operations are currently in progress; * find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt * find real data space usage on each brick; * calculate partitioning and ideal data space usage on each data brick; * find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Real space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * T = 0.6667 * 2490016 = 1660094 I2 = C2 * T = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. 0f04f7151b79c30995f59495740dd9d4985d49b4 4330 4329 2019-12-31T11:18:02Z Edward 4 Improve formatting Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. = Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing = Basic configuration of a logical volume is the following information: * Volume UUID; * Number of bricks in the volume; * List of brick names or UUIDs in the volume; * UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. Abstract capacity (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. Capacity of a logical volume is defined as a sum of capacities of its bricks-components. Relative capacity of a brick is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. (Real) data space usage on a brick is number of data blocks, stored on that brick. Ideal (or expected) data space usage on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! 2. Prepare Software and Hardware Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found here: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest libaal library: https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ Download, build and install the latest version 2.A.B of Reiser4progs package: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? 3. Creating a logical volume Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). 4. Adding a data brick to LV. At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). 5. Removing a data brick from LV At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. 6. Changing brick's capacity At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. 7. Operations with meta-data brick Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick sumply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. 8. Unmounting a logical volume To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME 9. Deploying a logical volume after correct unmount Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. 10. Deploying a logical volume after correct shutdown To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. 11. Deploying a logical volume after hard reset or system crash If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: a. Adding a brick was interrupted. Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. b. Brick removal was interrupted. Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. c. Another volume operation was interrupted Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt 12. LV monitoring. Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. 13. Checking free space To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 14. Checking quality of data distribution Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution 1) make sure that meta-data brick doesn't contain data blocks; 2) make sure that no regular file and volume operations are currently in progress; 3) find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt 4) find real data space usage on each brick; 5) calculate partitioning and ideal data space usage on each data brick; 6) find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Real space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * T = 0.6667 * 2490016 = 1660094 I2 = C2 * T = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. 9f5b05fecf2c37f88fd01f6289f736460c3f14ab 4329 2019-12-31T11:00:30Z Edward 4 Added "Logical Volumes Administration" page The most recent version of this document will be available here: https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration Reiser5 logical volumes with builtin fair distribution and transparent data migration capabilities. Administration guide - getting started. Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices. WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume. 1. Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing Basic configuration of a logical volume is the following information: 1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3). For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume. Abstract capacity (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one. Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user. Capacity of a logical volume is defined as a sum of capacities of its bricks-components. Relative capacity of a brick is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick. Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1. (Real) data space usage on a brick is number of data blocks, stored on that brick. Ideal (or expected) data space usage on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick. It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput. When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity. Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume. Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility). It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY). So, don't forget to bring your LV to the balanced state as soon as possible! 2. Prepare Software and Hardware Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found here: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs: "Loading Reiser4 (Software Framework Release: 5.X.Y)" Build and install the latest libaal library: https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ Download, build and install the latest version 2.A.B of Reiser4progs package: https://sourceforge.net/projects/reiser4/files/v5-unstable/ Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine: # volume.reiser4 -? 3. Creating a logical volume Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K. Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility: # mkfs.reiser4 -U $VOL_ID -t 256K /dev/vdb1 Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume. Mount your logical volume consisting of one meta-data brick: # mount /dev/vdb1 /mnt Find a record about your volume in the output of the following command: # volume.reiser4 -l Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume! Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV). 4. Adding a data brick to LV. At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, adding a brick will increase capacity of your volume. Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error. Format it by the same way as meta-data brick, but specify also "-a" option (to let mkfs know that it is data brick). # mkfs.reiser4 -U $VOL_ID -t 256K -a /dev/vdb2 Important: make sure you specified the same volume ID and stripe size as other bricks of the logical volume do have. Otherwise, operation of adding a data brick will fail. Update configuration of your volume with UUID or name of the brick you want to add (item #4). To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point: # volume.reiser4 -a /dev/vdb2 /mnt The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair). Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small. Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch. Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4). 5. Removing a data brick from LV At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY. Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC). Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt. Update your volume configuration with the UUID and name of the brick you want to remove (#item #4). To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point: # volume.reiser4 -r /dev/vdb2 /mnt The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV: # volume.reiser4 /mnt If volume is not balanced, then simply complete balancing manually: # volume.reiser4 -b /mnt Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove. On success update your volume configuration: remove information about the removed brick at #3 and #4. 6. Changing brick's capacity At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities. In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks. To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run # volume.reiser4 -z /dev/vdb1 -c 200000 /mnt Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt". The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running # volume.reiser4 -b /mnt Otherwise, repeat the operation from scratch. Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead. 7. Operations with meta-data brick Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA. Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied. To check the status of meta-data brick sumply run # volume.reiser4 /mnt and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent. 8. Unmounting a logical volume To terminate a mount session just issue usual umount command with the mount point specified. Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command: # volume.reiser4 -u BRICK_NAME 9. Deploying a logical volume after correct unmount Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. The list of all volumes and bricks registered in the system can be found in the output of the following command: # volume.reisrer4 -l Issue usual mount command against one of the bricks of your volume. It is recommended to specify meta-data brick in the mount command. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong set of bricks is registered in the system. It can happen due to careless handling of off-line volumes, leading to the appearance of "artifacts" in the list of registered bricks. If you want to re-format a brick, make sure it is unregistered. 10. Deploying a logical volume after correct shutdown To be able to mount your LV make sure that all its bricks (data and meta-data) are registered in the system. If not all bricks of the volume are registered, then attempts to mount such volume will fail with a respective kernel message. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult. To register a brick in the system use the following command: # volume.reiser4 -g BRICK_NAME To print a list of all registered bricks use # volume.reiser4 -l To mount your LV just issue a mount command for any one brick of your LV. Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not needed to preregister bricks you want to issue a mount command against. 11. Deploying a logical volume after hard reset or system crash If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in section 9. In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel. Depending on a kind of interrupted volume operation, perform one of the following actions: a. Adding a brick was interrupted. Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again. Check the status of your LV by running # volume.reiser4 /mnt In the volume is unbalanced, then complete balancing manually by running # volume.reiser4 -b /mnt Check "bricks total" of your LV in the output of # volume.reiser4 /mnt Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively. b. Brick removal was interrupted. Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV: # volume.reiser4 /mnt Update your volume configuration respectively. c. Another volume operation was interrupted Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful. Check the status of your LV: # volume.reiser4 /mnt If the volume is unbalanced then complete balancing manually by running # volume.reiser4 -b /mnt 12. LV monitoring. Common info about LV mounted at /mnt # volume.reiser4 /mnt ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume Info about any its brick of index J # volume.reiser4 -p J /mnt internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY). WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress. 13. Checking free space To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run # sync # df --block-size=4K /mnt To check number of free blocks on the brick of index J run # volume.reiser4 -p J /mnt Then calculate the difference between block count and blocks used Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail). NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks. "Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95 14. Checking quality of data distribution Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality. Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one. If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority. Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks. To check quality of distribution 1) make sure that meta-data brick doesn't contain data blocks; 2) make sure that no regular file and volume operations are currently in progress; 3) find "blocks used", "system blocks" and "data capacity" statistics for each data brick: # sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt 4) find real data space usage on each brick; 5) calculate partitioning and ideal data space usage on each data brick; 6) find deviation of (4) from (5). Example. Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning: # VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt Fill the meta-data brick with data: # dd if=/dev/zero of=/mnt/myfile bs=256K No space left on device... Add data-bricks /dev/sdc1 and dev/sdd1 to the volume: # volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt Move all data blocks to the newly added bricks: # volume.reiser4 -r /dev/vdb1 /mnt # sync Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution # volume.reiser4 /mnt -p0 blocks used: 503 # volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069 # volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391 Basing on the statistics above calculate quality of distribution. Total data capacity of the volume: 2621069 + 1310391 = 3931460 Relative capacities of data bricks: C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333 Real space usage on data bricks (blocks used - system blocks): R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928 Real space usage on the volume: R = R1 + R2 = 1657088 + 832928 = 2490016 Ideal data space usage on data bricks: I1 = C1 * T = 0.6667 * 2490016 = 1660094 I2 = C2 * T = 0.3333 * 2490016 = 829922 Deviation: D = (R1, R2) - (I1, I2) = (3006, -3006) Relative deviation: D/R = (-0.0012, 0.0012) Quality of distribution: Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988 Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q. Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven. df19f932296d4bf587aa29cb1b09c0c1a714d67e Logical Volumes Background 0 1112 4453 4452 2020-12-10T02:20:57Z Edward 4 /* Stripes. Fibers. Distribution, allocation and migration. Basic definitions */ Reiser4 offers a brand new method of aggregation of block devices into logical volumes on a local machine. It is a qualitatively new level in file systems (and operating systems) development - local volumes with '''parallel scaling out'''. Reiser4 doesn't implement its own block layer like ZFS etc. In our approach scaling out is performed by file system means, rather than by block layer means. The flow of IO-requests issued against each device is controlled by user. To add a device to a logical volume with parallel scaling out, you first need to format that device - this is the difference between parallel and non-parallel scaling at first glance. The principal difference between parallel and non-parallel scaling out will be discussed below. Systems with parallel scaling out provide better scalability and resolve a number of problems inherent to non-parallel ones. In logical volumes with parallel scaling out devices of smaller size and(or) throughput don't become a "bottlenecks", as it happens e.g. in RAID-0 and its popular modifications. = Fundamental shortcomings of logical volumes composed by block-layer means = 1. Local file systems don't take participation in scaling out. They just face a huge virtual block device, for which they need to maintain a free space map. Such maps grow as the volume fills with data. It results in increasing latency on free blocks allocation and consequently in essential performance drop on large volumes which are almost full. 2. Loss of disk resources (space and throughput) on logical volumes composed of devices with different physical and geometric parameters (because of poor translations provided by "classic" RAID levels). Low-performance devices become a bottleneck in RAID arrays Attempts to replace RAID levels with better algorithms lead to inevitable and unfixable degradation of logical volumes. Indeed, defragmentation tools work only on the space of virtual disk addresses. If you use classic RAID levels, then everything is fine here: reducing fragmentation on virtual device always results in reducing fragmentation on physical ones. However, if you use more sophisticated translations to save disk space and bandwidth, then fragmentations on real devices tends to accumulate, and you are not able to reduce it just defragmenting the virtual device. Note that the interest is always real devices - no one actually cares what happens on virtual ones. 3. With only block layer means it is impossible to build heterogeneous storage composed of devices of different nature. You are not able to use different approaches to devices-components of the same logical volume (e.g. defragment only rotational drives, issue discard requests only for solid state drives, etc). 4. It is impossible to efficiently implement data migration (and, hence, data tiering) on logical volumes composed by block-layer means. = The previous art = Previously, there was only one method for scaling out local volumes - by block layer means. That is, file system deals only with virtual disk addresses (allocation, defragmentation, etc), and the block layer translates virtual disk addresses to real ones and backward. The most common complaint is about performance drop on such logical volumes, which are large and more than 70% full. Mostly it is related to disk space allocators, which are, to put it mildly, not excellent, and introduce big latency when searching for a free blocks on extra-large volumes. Moreover, nothing better has been invented for the past 30 years. Also, it easily may be that the best algorithms for free space management simply do not exist. Some file systems (ZFS and like) implement their own block layers. It helps to implement a failover, however, the mentioned problem doesn't disappear - if the block layer does its job very well, then the file system, again, faces a huge logical block device, which is hard to handle. Significant progress in scaling out was made by parallel network file systems (GPFS, Lustre, etc). However it was unclear, how to apply their technologies to a local FS. Mostly, it is because local file systems don't have such luxury like "backend storage" as the network ones do. What local FS does have - is only extremely poor interface of interaction with the block layer. For example, in Linux local FS can only compose and issue an IO request against some buffer (page). In other words, it was unclear, what a "local parallel file system" is. = Our approach. O(1) space allocator = In ~2010 we had realized that the first approach (implementation an own block layer inside a local FS) is totally wrong. Instead we need to pay attention to parallel network file systems to adopt their methods. However, as I mentioned, there is no something even close to a direct analogy - it means that for local FS we need to design "parallelism" from scratch. The same about distribution algorithms - we are totally unhappy with existing ones. Of course, you can deploy a networking FS on the same local machine for a number of block devices, but it will be something not serious. We state that a serious analogy can be defined and implemented in properly designed local FS - meet Reiser5. The basic idea is pretty simple - to not mess with large free space maps (whose sizes depend on the volume size). Instead, we need to manage many small ones of limited size. At any moment the file system should be able to pick up a proper such small space map, and work only with it. Needless to say, that for any logical volume, which is as big as you want, search time in a such map will be also limited by some value, which doesn't depend on logical volume size. For this reason, we'll call it O(1) - space allocator. The simplest way is to maintain one space map per each block device, which is a component of the logical volume. If some device is too large, simply split it into a number of partitions to make sure that any space map does not exceed some upper limit. Thus, users also should put some efforts from their side to make the space allocator be O(1). = Parallel scaling out as disk resources conservation. Definitions and examples = Here we'll consider an abstract subsystem S of the operating system managing a logical volume composed of one, or more removable components. ''Definition 1''. We'll say that S '''saves disk space and throughput''' of the logical volume, if 1) its data capacity is a sum of data capacities of its components 2) its disk bandwidth is a sum of disk bandwidths of its components We'll say that LV managed by such system is with '''parallel scaling out''' (PSO). There is a good analogy to understand the feature of PSO: imagine that it rains and you put several cylindrical buckets with different sized holes for collecting water. In this example raindrops represent IO-requests, the set of bucket represents a logical volume. Note that amount of water felt to each bucket is proportional to the square of its hole (considered as throughput). In this example all buckets are filled with water evenly and fairly: if one bucket is full, then other ones is also full. Note, that non-cylindrical form of buckets will likely break fairness of water distribution between them, so that PSO won't take place in this case. In practice, however, IO-systems are more complicated: IO requests are distributed, queued, etc. And conservation of disk resources usually doesn't take place: disk bandwidth of any logical volume turns out to be smaller than the sum of ones of its components. Nevertheless, if the loss of resources is small, and doesn't increase with the growth of the volume, then we'll say that such system features '''parallel scaling out'''. In complex IO-systems "leak" of disk bandwidth has complex nature and can happen on every its subsystem: on the file system, on the block layer, etc. The loss can also be caused by interface properties, etc. The fundamental reason of almost all resource leaks is that mentioned subsystems were poorly designed (because better algorithms were not known at that moment, or because of other reasons). The classic example of disk space and throughput loss is RAID arrays. Linear RAID saves disk space, but always drops disk bandwidth. RAID-0, composed of different size and bandwidth devices, drops both, disk space and disk bandwidth of the resulted logical volume. The same is for their modifications like RAID-5. In all mentioned examples the loss of disk bandwidth is caused by poor algorithms (specifically, by the fact that IO requests are directed to every component in wrong proportions). ''Definition 2''. A file system managing a logical volume is said to be with '''parallel scaling out''', if it saves disk space and bandwidth of that logical volume. In other words, if it doesn't drop the mentioned disk resources. Note that file system is only a part of an IO-subsystem. And it can easily happen that the file system saves disk resources, while the whole system is not. For example, because of poorly designed block layer, who puts IO requests issued for different block devices to the same queue on a local machine, etc. As an example, let's calculate disk bandwidth of a logical volume composed of 2 devices, the first of which has disk bandwidth 200M/sec, second - 300M/sec. We'll consider 3 systems: in the first one the mentioned devices compose linear RAID, in the second one - striped RAID (RAID-0), in the third one they are managed by a file system with parallel scaling out. Linear RAID distributes IO requests not evenly: first we write to the first device. Once it is full, we write to the second one. Disk bandwidth of linear RAID is defined by the throughput of the device we are currently writing to. Thus it always is not more than throughput of the faster device, i.e. 300 M/sec. RAID-0 distributes IO requests evenly (but not fairly). In the same interval of time the same number N/2 of IO-requests will be issued against each device. On the first device it will be written in N/400 sec. On the second device it will be written in N/600 sec. Note that the first device is slower, therefore we should wait N/400 sec for all N IO-requests to be written to the array. So throughput of RAID-0 in our case is N/(N/400) = 400 M/sec. FS with parallel scaling out distributes IO requests evenly and fairly. In the same interval of time the number of blocks issued against each device is N*C, where C is relative throughput of the device. Relative throughput of the first device is 200/(200+300) = 0.4. Of the second one - 300/(200+300) = 0.6 Portion of IO-requests issued for each device will be written in parallel in the same time 0.4N/200 = 0.6N/300 sec. Therefore, throughput of our logical volume in this case is N/(0.4N/200) = 500 M/sec. The resulted table of throughput: Linear RAID: <300 M/sec RAID-0: 400 M/sec Parallel scaling out FS 500 M/sec According to definitions above any local file system built on a top of RAID/LVM does NOT possesses parallel scaling out (first, because RAID and LVM don't save disk resources, second, because latency introduced by free space allocator grows with volume. For the same reasons any local FS, which implements its own block layer (ZFS, Btrfs, etc) does NOT possesses parallel scaling out. Note that any network FS built on a top of two or more local FS managing simple partitions as backend saves disk resources. = Overhead of parallelism for local FS = As we mentioned above, the characteristic feature of any FS with PSO is that before adding a device to a logical volume you should format it. Of course, it adds some overhead to the system. However, that overhead is not dramatically large. Specifically, with reiser4 disk format40 specifications the disk overhead includes 80K at the beginning of each device-component. Next, for each device Reiser5 reads on-disk super-block and loads its to memory, Thus, memory overhead includes one persistent memory super-block (~500 bytes) per each device-component of a logical volume. That is, a logical volume composed of one million devices will take ~500M of memory (pinned). I think that a person maintaining such volume will be able to find $30 for additional memory card. That overhead is a single disadvantage of FS with PSO. At least, we don't know other ones. = Аsymmetric logical volumes. Data and meta-data bricks = So, any logical volume with parallel scaling out is composed of block devices formatted by mkfs utility. Such device has a special name '''brick''', or '''storage subvolume''' of a logical volume. For the beginning we have implemented the simplest approach, when meta-data is located on dedicated block devices - we'll call them '''meta-data bricks'''. I remind that in reiser4 the notion of "meta-data" includes all kind of items (key'ed records in the storage tree). And the notion of data means unformatted blocks pointed out by "extent pointers". Such unformatted nodes are used to store bodies of regular files. Meta-data bricks are allowed to contain unformatted data blocks. '''Data bricks''' contain only unformatted data blocks. For obvious reasons such logical volumes are called "asymmetric". = Stripes. Fibers. Distribution, allocation and migration. Basic definitions = '''Stripe''' is a logical unit of distribution, that is a minimal object, any parts of which can not be stored on different bricks. A set of distribution units got to the same brick is called '''fiber'''. Comment. In the previous art fibers were called stripes (case of RAID-0), and logical units of distribution didn't have a special name. For a number of adjacent sectors forming such a unit a notion of "stripe width" was used. '''Data stripe''' is a stripe of a file body. That is, logical unit of distribution in a file. '''Meta-data striping''' also can be defined, but we don't consider it here for simplicity. '''Data block''' is a logical unit of allocation in a file. Data block size is always equal to the file system block size. From these definitions it directly follows that data block can not contain more than one stripe. On the other hand, one data stripe can consist of many data blocks. For any data stripe its '''full address''' in a logical volume is defined as a pair (brick-ID, V), where V is vector of extents representing disk addresses of blocks that data stripe consists of. Data block is said '''allocated''', if it got disk address on some brick (i.e. a physical block number was allocated for that logical block). Stripe is said '''allocated''', if all data blocks, it consists of, got disk addresses on some brick. Stripe is said '''distributed''', if it got the first component (brick-ID) of its full address in the logical volume. Stripe is said '''migrated''', if its old disk addresses got released, and new ones (possibly on another brick) got allocated. The core difference between parallel and non-parallel scaling out in terms of distribution and allocation: In systems with parallel scaling out any stripe firstly gets distributed, then allocated. In systems with non-parallel scaling out it is other way around - any stripe firstly gets allocated, then distributed. An example of a system with not parallel scaling out is any local FS built a top of RAID-0 array. Indeed, at first, such FS allocates a virtual disk address for a logical block, then the block layer assigns a real device-ID and translates that virtual address to real one. = Data distribution and migration. Fiber-Striping. Burst Buffers = Distribution defines what device-component of a logical volume an IO request composed for a dirty buffer(page) will be issued against. In file systems with PSO "destination" device is always defined by a virtual disk address allocated for that request. E.g. for RAID-0 ID of destination device is defined as (N % M), where N is a virtual address (block number), allocated by the file system, M is number of disks in the array. In our approach (O(1) space allocator) we allocate disk addresses on each physical device independently, so for every IO-request we first need to assign a destination device, then ask a block allocator managing that device to allocate a block number for this request. So, in our approach distribution doesn't depend on allocation. By default Reiser5 offers distribution based on algorithms (so-called '''fiber-striping''') invented by Eduard Shishkin (patented stuff). With our algorithms all your data will be distributed evenly and fairly among all devices-components of the logical volume. It means that portion of IO requests issued against each device is equal to relative capacity of that device assigned by user. Operation of adding/removing a device to/from a logical volume automatically invokes data migration, so that resulted distribution is also fair. Portion of migrated data is always equal to relative capacity of the added/removed device. The speed of data migration is mostly determined by throughput of the device to be added/removed. Alternatively, Reiser4 allows users to control data distribution and migration themselves. An important application distribution and migration find as '''data tiering''' in HPC area as so-called '''Burst Buffers''' (dump of "hot data" on high-performance proxy-device with its following migration to "persistent storage" in background mode). In all cases the file system memorizes stripes location. = Atomicity of volume operations = Almost all volume operations (adding/removing a brick, changing bricks capacity, etc) involve re-balancing (i.e. massive migration of data blocks), so it is technically difficult to implement full atomicity of such operations. Instead, we issue 2 checkpoints (first before re-balancing, second - after), and handle 2 cases depending on where in relation to those points the volume operation was interrupted. In the first case user should repeat the operation again, in the second case user should complete the operation (in the background mode) using volume.reiser4 utility. See administration guide on reiser4 logical volumes for details. = Limitations on asymmetric logical volumes = Maximal number of bricks in a logical volume: in the "builtin" distribution mode - 2^32 in the "custom" distribution mode - 2^64 In the "builtin" distribution mode any 2 bricks of the same logical volume can not differ in size more than 2^19 (~1 million) times. For example, your logical volume can not contain both, 1M and 2T bricks. Maximal number of stripe pointers held by one 4K-metadata block: 75 (for node40 format). Maximal number of data blocks served by 1 meta-data block: 75*S, where S is stripe width in file system blocks. For example, for 128K- stripes and 4K blocks (S=32) one meta-data block can serve not more than 2400 data blocks. In particular, when all bricks are of equal capacity, it means that one meta-data brick can serve not more than 2400 data bricks. For the best quality of "builtin" distribution it is recommended that: a) stripe size is not larger than 1/10000 of total volume size. b) number of bricks in your logical volume is a power of 2 (i.e. 2, 4, 8, 16, etc). If you cannot afford it, then make sure that number of hash space segments (a property of your logical volume, which can be increased online) is not smaller than 100 * number-of-bricks. Not more than one volume operations on the same logical volume can be executed in parallel. If some volume operation is not completed, then attempts to execute other ones will return error (EBUSY). = Security issues = "Builtin" distribution combines random and deterministic methods. It is "salted" with volume-ID, which is known only to root. Once it is compromised (revealed), the logical volume can be subjected to "free space attack" - with known volume-ID an attacker (non-privileged user) will be able to fill some data brick up to 100%, while others have a lot of free space. Thus, nobody will be able to write anymore to that volume. So, keep your volume-ID a secret! = Software and Disk Version 5.1.3. Compatibility = To implement parallel scaling out we upgraded Reiser4 code base with the following new plugins: 1) "asymmetric" volume plugin (new interface); 2) "fsx32" distribution plugin (new interface); 3) "striped" file plugin (existing interface); 4) "extent41" item plugin (existing interface); 5) "format41" disk format plugin (existing interface). In the best traditions we increment version numbers. The old disk and software version was 4.0.2. "Minor" number (2) is incremented because of (1-4). "Major" number (0) is incremented because of (5) and changes in the format super-block. "Principal" number (4) is incremented because of changes in master super-block. For more details about compatibility see [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model this] Old reiser4 partitions (of format 4.0.X) will be supported by Reiser5 kernel module. For this you need to enable option "support "Plan-A key allocation scheme" (not default), when configuring the kernel. Note that it will automatically disable support of logical volumes. Such mutual exclusiveness is due to performance reasons. Reiser4progs of software release number 5.X.Y don't support old reiser4 partitions of format 4.0.X. To fsck the last ones use reiser4progs of software release number 4.X.Y - it will exist as a separate branch. = TODO = . Upgrading FSCK to work with logical volumes; . Asymmetric LV w/ more than 1 meta-data brick per-volume; . Symmetric logical volumes (meta-data on all bricks); . 3D-snapshots of LV (snapshots with an ability to roll back not only file operations, but also volume operations); . Global (networking) logical volumes. [[category:Reiser4]] 2a49de9f1c663e7de2350aa4593dbb8ed9e1c9a6 4452 4451 2020-12-10T02:06:08Z Edward 4 /* Stripes. Fibers. Distribution, allocation and migration. Basic definitions */ Reiser4 offers a brand new method of aggregation of block devices into logical volumes on a local machine. It is a qualitatively new level in file systems (and operating systems) development - local volumes with '''parallel scaling out'''. Reiser4 doesn't implement its own block layer like ZFS etc. In our approach scaling out is performed by file system means, rather than by block layer means. The flow of IO-requests issued against each device is controlled by user. To add a device to a logical volume with parallel scaling out, you first need to format that device - this is the difference between parallel and non-parallel scaling at first glance. The principal difference between parallel and non-parallel scaling out will be discussed below. Systems with parallel scaling out provide better scalability and resolve a number of problems inherent to non-parallel ones. In logical volumes with parallel scaling out devices of smaller size and(or) throughput don't become a "bottlenecks", as it happens e.g. in RAID-0 and its popular modifications. = Fundamental shortcomings of logical volumes composed by block-layer means = 1. Local file systems don't take participation in scaling out. They just face a huge virtual block device, for which they need to maintain a free space map. Such maps grow as the volume fills with data. It results in increasing latency on free blocks allocation and consequently in essential performance drop on large volumes which are almost full. 2. Loss of disk resources (space and throughput) on logical volumes composed of devices with different physical and geometric parameters (because of poor translations provided by "classic" RAID levels). Low-performance devices become a bottleneck in RAID arrays Attempts to replace RAID levels with better algorithms lead to inevitable and unfixable degradation of logical volumes. Indeed, defragmentation tools work only on the space of virtual disk addresses. If you use classic RAID levels, then everything is fine here: reducing fragmentation on virtual device always results in reducing fragmentation on physical ones. However, if you use more sophisticated translations to save disk space and bandwidth, then fragmentations on real devices tends to accumulate, and you are not able to reduce it just defragmenting the virtual device. Note that the interest is always real devices - no one actually cares what happens on virtual ones. 3. With only block layer means it is impossible to build heterogeneous storage composed of devices of different nature. You are not able to use different approaches to devices-components of the same logical volume (e.g. defragment only rotational drives, issue discard requests only for solid state drives, etc). 4. It is impossible to efficiently implement data migration (and, hence, data tiering) on logical volumes composed by block-layer means. = The previous art = Previously, there was only one method for scaling out local volumes - by block layer means. That is, file system deals only with virtual disk addresses (allocation, defragmentation, etc), and the block layer translates virtual disk addresses to real ones and backward. The most common complaint is about performance drop on such logical volumes, which are large and more than 70% full. Mostly it is related to disk space allocators, which are, to put it mildly, not excellent, and introduce big latency when searching for a free blocks on extra-large volumes. Moreover, nothing better has been invented for the past 30 years. Also, it easily may be that the best algorithms for free space management simply do not exist. Some file systems (ZFS and like) implement their own block layers. It helps to implement a failover, however, the mentioned problem doesn't disappear - if the block layer does its job very well, then the file system, again, faces a huge logical block device, which is hard to handle. Significant progress in scaling out was made by parallel network file systems (GPFS, Lustre, etc). However it was unclear, how to apply their technologies to a local FS. Mostly, it is because local file systems don't have such luxury like "backend storage" as the network ones do. What local FS does have - is only extremely poor interface of interaction with the block layer. For example, in Linux local FS can only compose and issue an IO request against some buffer (page). In other words, it was unclear, what a "local parallel file system" is. = Our approach. O(1) space allocator = In ~2010 we had realized that the first approach (implementation an own block layer inside a local FS) is totally wrong. Instead we need to pay attention to parallel network file systems to adopt their methods. However, as I mentioned, there is no something even close to a direct analogy - it means that for local FS we need to design "parallelism" from scratch. The same about distribution algorithms - we are totally unhappy with existing ones. Of course, you can deploy a networking FS on the same local machine for a number of block devices, but it will be something not serious. We state that a serious analogy can be defined and implemented in properly designed local FS - meet Reiser5. The basic idea is pretty simple - to not mess with large free space maps (whose sizes depend on the volume size). Instead, we need to manage many small ones of limited size. At any moment the file system should be able to pick up a proper such small space map, and work only with it. Needless to say, that for any logical volume, which is as big as you want, search time in a such map will be also limited by some value, which doesn't depend on logical volume size. For this reason, we'll call it O(1) - space allocator. The simplest way is to maintain one space map per each block device, which is a component of the logical volume. If some device is too large, simply split it into a number of partitions to make sure that any space map does not exceed some upper limit. Thus, users also should put some efforts from their side to make the space allocator be O(1). = Parallel scaling out as disk resources conservation. Definitions and examples = Here we'll consider an abstract subsystem S of the operating system managing a logical volume composed of one, or more removable components. ''Definition 1''. We'll say that S '''saves disk space and throughput''' of the logical volume, if 1) its data capacity is a sum of data capacities of its components 2) its disk bandwidth is a sum of disk bandwidths of its components We'll say that LV managed by such system is with '''parallel scaling out''' (PSO). There is a good analogy to understand the feature of PSO: imagine that it rains and you put several cylindrical buckets with different sized holes for collecting water. In this example raindrops represent IO-requests, the set of bucket represents a logical volume. Note that amount of water felt to each bucket is proportional to the square of its hole (considered as throughput). In this example all buckets are filled with water evenly and fairly: if one bucket is full, then other ones is also full. Note, that non-cylindrical form of buckets will likely break fairness of water distribution between them, so that PSO won't take place in this case. In practice, however, IO-systems are more complicated: IO requests are distributed, queued, etc. And conservation of disk resources usually doesn't take place: disk bandwidth of any logical volume turns out to be smaller than the sum of ones of its components. Nevertheless, if the loss of resources is small, and doesn't increase with the growth of the volume, then we'll say that such system features '''parallel scaling out'''. In complex IO-systems "leak" of disk bandwidth has complex nature and can happen on every its subsystem: on the file system, on the block layer, etc. The loss can also be caused by interface properties, etc. The fundamental reason of almost all resource leaks is that mentioned subsystems were poorly designed (because better algorithms were not known at that moment, or because of other reasons). The classic example of disk space and throughput loss is RAID arrays. Linear RAID saves disk space, but always drops disk bandwidth. RAID-0, composed of different size and bandwidth devices, drops both, disk space and disk bandwidth of the resulted logical volume. The same is for their modifications like RAID-5. In all mentioned examples the loss of disk bandwidth is caused by poor algorithms (specifically, by the fact that IO requests are directed to every component in wrong proportions). ''Definition 2''. A file system managing a logical volume is said to be with '''parallel scaling out''', if it saves disk space and bandwidth of that logical volume. In other words, if it doesn't drop the mentioned disk resources. Note that file system is only a part of an IO-subsystem. And it can easily happen that the file system saves disk resources, while the whole system is not. For example, because of poorly designed block layer, who puts IO requests issued for different block devices to the same queue on a local machine, etc. As an example, let's calculate disk bandwidth of a logical volume composed of 2 devices, the first of which has disk bandwidth 200M/sec, second - 300M/sec. We'll consider 3 systems: in the first one the mentioned devices compose linear RAID, in the second one - striped RAID (RAID-0), in the third one they are managed by a file system with parallel scaling out. Linear RAID distributes IO requests not evenly: first we write to the first device. Once it is full, we write to the second one. Disk bandwidth of linear RAID is defined by the throughput of the device we are currently writing to. Thus it always is not more than throughput of the faster device, i.e. 300 M/sec. RAID-0 distributes IO requests evenly (but not fairly). In the same interval of time the same number N/2 of IO-requests will be issued against each device. On the first device it will be written in N/400 sec. On the second device it will be written in N/600 sec. Note that the first device is slower, therefore we should wait N/400 sec for all N IO-requests to be written to the array. So throughput of RAID-0 in our case is N/(N/400) = 400 M/sec. FS with parallel scaling out distributes IO requests evenly and fairly. In the same interval of time the number of blocks issued against each device is N*C, where C is relative throughput of the device. Relative throughput of the first device is 200/(200+300) = 0.4. Of the second one - 300/(200+300) = 0.6 Portion of IO-requests issued for each device will be written in parallel in the same time 0.4N/200 = 0.6N/300 sec. Therefore, throughput of our logical volume in this case is N/(0.4N/200) = 500 M/sec. The resulted table of throughput: Linear RAID: <300 M/sec RAID-0: 400 M/sec Parallel scaling out FS 500 M/sec According to definitions above any local file system built on a top of RAID/LVM does NOT possesses parallel scaling out (first, because RAID and LVM don't save disk resources, second, because latency introduced by free space allocator grows with volume. For the same reasons any local FS, which implements its own block layer (ZFS, Btrfs, etc) does NOT possesses parallel scaling out. Note that any network FS built on a top of two or more local FS managing simple partitions as backend saves disk resources. = Overhead of parallelism for local FS = As we mentioned above, the characteristic feature of any FS with PSO is that before adding a device to a logical volume you should format it. Of course, it adds some overhead to the system. However, that overhead is not dramatically large. Specifically, with reiser4 disk format40 specifications the disk overhead includes 80K at the beginning of each device-component. Next, for each device Reiser5 reads on-disk super-block and loads its to memory, Thus, memory overhead includes one persistent memory super-block (~500 bytes) per each device-component of a logical volume. That is, a logical volume composed of one million devices will take ~500M of memory (pinned). I think that a person maintaining such volume will be able to find $30 for additional memory card. That overhead is a single disadvantage of FS with PSO. At least, we don't know other ones. = Аsymmetric logical volumes. Data and meta-data bricks = So, any logical volume with parallel scaling out is composed of block devices formatted by mkfs utility. Such device has a special name '''brick''', or '''storage subvolume''' of a logical volume. For the beginning we have implemented the simplest approach, when meta-data is located on dedicated block devices - we'll call them '''meta-data bricks'''. I remind that in reiser4 the notion of "meta-data" includes all kind of items (key'ed records in the storage tree). And the notion of data means unformatted blocks pointed out by "extent pointers". Such unformatted nodes are used to store bodies of regular files. Meta-data bricks are allowed to contain unformatted data blocks. '''Data bricks''' contain only unformatted data blocks. For obvious reasons such logical volumes are called "asymmetric". = Stripes. Fibers. Distribution, allocation and migration. Basic definitions = '''Stripe''' is a logical unit of distribution, that is a minimal object, any parts of which can not be stored on different bricks. A set of distribution units got to the same brick is called '''fiber'''. Comment. In the previous art fibers were called stripes (case of RAID-0), and logical units of distribution didn't have a special name. For a number of adjacent sectors forming such a unit a notion of "stripe width" was used. '''Data stripe''' is a stripe of a file body. That is, logical unit of distribution in a file. '''Meta-data striping''' also can be defined, but we don't consider it here for simplicity. '''Data block''' is a logical unit of allocation in a file. Data block size is always equal to the file system block size. From these definitions it directly follows that data block can not contain more than one stripe. On the other hand, one data stripe can consist of many data blocks. For any data block its '''full address''' in a logical volume is defined as a pair (brick-ID, disk-address). Data block is said '''allocated''', if it got disk address on some brick (i.e. a physical block number was allocated for that logical block). Stripe is said '''allocated''', if all data blocks, it consists of, got disk addresses on some brick. Stripe is said '''distributed''', if it got the first component (brick-ID) of its full address in the logical volume. Stripe is said '''migrated''', if its old disk addresses got released, and new ones (possibly on another brick) got allocated. The core difference between parallel and non-parallel scaling out in terms of distribution and allocation: In systems with parallel scaling out any stripe firstly gets distributed, then allocated. In systems with non-parallel scaling out it is other way around - any stripe firstly gets allocated, then distributed. An example of a system with not parallel scaling out is any local FS built a top of RAID-0 array. Indeed, at first, such FS allocates a virtual disk address for a logical block, then the block layer assigns a real device-ID and translates that virtual address to real one. = Data distribution and migration. Fiber-Striping. Burst Buffers = Distribution defines what device-component of a logical volume an IO request composed for a dirty buffer(page) will be issued against. In file systems with PSO "destination" device is always defined by a virtual disk address allocated for that request. E.g. for RAID-0 ID of destination device is defined as (N % M), where N is a virtual address (block number), allocated by the file system, M is number of disks in the array. In our approach (O(1) space allocator) we allocate disk addresses on each physical device independently, so for every IO-request we first need to assign a destination device, then ask a block allocator managing that device to allocate a block number for this request. So, in our approach distribution doesn't depend on allocation. By default Reiser5 offers distribution based on algorithms (so-called '''fiber-striping''') invented by Eduard Shishkin (patented stuff). With our algorithms all your data will be distributed evenly and fairly among all devices-components of the logical volume. It means that portion of IO requests issued against each device is equal to relative capacity of that device assigned by user. Operation of adding/removing a device to/from a logical volume automatically invokes data migration, so that resulted distribution is also fair. Portion of migrated data is always equal to relative capacity of the added/removed device. The speed of data migration is mostly determined by throughput of the device to be added/removed. Alternatively, Reiser4 allows users to control data distribution and migration themselves. An important application distribution and migration find as '''data tiering''' in HPC area as so-called '''Burst Buffers''' (dump of "hot data" on high-performance proxy-device with its following migration to "persistent storage" in background mode). In all cases the file system memorizes stripes location. = Atomicity of volume operations = Almost all volume operations (adding/removing a brick, changing bricks capacity, etc) involve re-balancing (i.e. massive migration of data blocks), so it is technically difficult to implement full atomicity of such operations. Instead, we issue 2 checkpoints (first before re-balancing, second - after), and handle 2 cases depending on where in relation to those points the volume operation was interrupted. In the first case user should repeat the operation again, in the second case user should complete the operation (in the background mode) using volume.reiser4 utility. See administration guide on reiser4 logical volumes for details. = Limitations on asymmetric logical volumes = Maximal number of bricks in a logical volume: in the "builtin" distribution mode - 2^32 in the "custom" distribution mode - 2^64 In the "builtin" distribution mode any 2 bricks of the same logical volume can not differ in size more than 2^19 (~1 million) times. For example, your logical volume can not contain both, 1M and 2T bricks. Maximal number of stripe pointers held by one 4K-metadata block: 75 (for node40 format). Maximal number of data blocks served by 1 meta-data block: 75*S, where S is stripe width in file system blocks. For example, for 128K- stripes and 4K blocks (S=32) one meta-data block can serve not more than 2400 data blocks. In particular, when all bricks are of equal capacity, it means that one meta-data brick can serve not more than 2400 data bricks. For the best quality of "builtin" distribution it is recommended that: a) stripe size is not larger than 1/10000 of total volume size. b) number of bricks in your logical volume is a power of 2 (i.e. 2, 4, 8, 16, etc). If you cannot afford it, then make sure that number of hash space segments (a property of your logical volume, which can be increased online) is not smaller than 100 * number-of-bricks. Not more than one volume operations on the same logical volume can be executed in parallel. If some volume operation is not completed, then attempts to execute other ones will return error (EBUSY). = Security issues = "Builtin" distribution combines random and deterministic methods. It is "salted" with volume-ID, which is known only to root. Once it is compromised (revealed), the logical volume can be subjected to "free space attack" - with known volume-ID an attacker (non-privileged user) will be able to fill some data brick up to 100%, while others have a lot of free space. Thus, nobody will be able to write anymore to that volume. So, keep your volume-ID a secret! = Software and Disk Version 5.1.3. Compatibility = To implement parallel scaling out we upgraded Reiser4 code base with the following new plugins: 1) "asymmetric" volume plugin (new interface); 2) "fsx32" distribution plugin (new interface); 3) "striped" file plugin (existing interface); 4) "extent41" item plugin (existing interface); 5) "format41" disk format plugin (existing interface). In the best traditions we increment version numbers. The old disk and software version was 4.0.2. "Minor" number (2) is incremented because of (1-4). "Major" number (0) is incremented because of (5) and changes in the format super-block. "Principal" number (4) is incremented because of changes in master super-block. For more details about compatibility see [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model this] Old reiser4 partitions (of format 4.0.X) will be supported by Reiser5 kernel module. For this you need to enable option "support "Plan-A key allocation scheme" (not default), when configuring the kernel. Note that it will automatically disable support of logical volumes. Such mutual exclusiveness is due to performance reasons. Reiser4progs of software release number 5.X.Y don't support old reiser4 partitions of format 4.0.X. To fsck the last ones use reiser4progs of software release number 4.X.Y - it will exist as a separate branch. = TODO = . Upgrading FSCK to work with logical volumes; . Asymmetric LV w/ more than 1 meta-data brick per-volume; . Symmetric logical volumes (meta-data on all bricks); . 3D-snapshots of LV (snapshots with an ability to roll back not only file operations, but also volume operations); . Global (networking) logical volumes. [[category:Reiser4]] 1590e4556e0015a0f1768bd9efaadc1c44762b5c 4451 4450 2020-12-10T01:40:18Z Edward 4 /* Stripes. Fibers. Distribution, allocation and migration. Basic definitions */ Reiser4 offers a brand new method of aggregation of block devices into logical volumes on a local machine. It is a qualitatively new level in file systems (and operating systems) development - local volumes with '''parallel scaling out'''. Reiser4 doesn't implement its own block layer like ZFS etc. In our approach scaling out is performed by file system means, rather than by block layer means. The flow of IO-requests issued against each device is controlled by user. To add a device to a logical volume with parallel scaling out, you first need to format that device - this is the difference between parallel and non-parallel scaling at first glance. The principal difference between parallel and non-parallel scaling out will be discussed below. Systems with parallel scaling out provide better scalability and resolve a number of problems inherent to non-parallel ones. In logical volumes with parallel scaling out devices of smaller size and(or) throughput don't become a "bottlenecks", as it happens e.g. in RAID-0 and its popular modifications. = Fundamental shortcomings of logical volumes composed by block-layer means = 1. Local file systems don't take participation in scaling out. They just face a huge virtual block device, for which they need to maintain a free space map. Such maps grow as the volume fills with data. It results in increasing latency on free blocks allocation and consequently in essential performance drop on large volumes which are almost full. 2. Loss of disk resources (space and throughput) on logical volumes composed of devices with different physical and geometric parameters (because of poor translations provided by "classic" RAID levels). Low-performance devices become a bottleneck in RAID arrays Attempts to replace RAID levels with better algorithms lead to inevitable and unfixable degradation of logical volumes. Indeed, defragmentation tools work only on the space of virtual disk addresses. If you use classic RAID levels, then everything is fine here: reducing fragmentation on virtual device always results in reducing fragmentation on physical ones. However, if you use more sophisticated translations to save disk space and bandwidth, then fragmentations on real devices tends to accumulate, and you are not able to reduce it just defragmenting the virtual device. Note that the interest is always real devices - no one actually cares what happens on virtual ones. 3. With only block layer means it is impossible to build heterogeneous storage composed of devices of different nature. You are not able to use different approaches to devices-components of the same logical volume (e.g. defragment only rotational drives, issue discard requests only for solid state drives, etc). 4. It is impossible to efficiently implement data migration (and, hence, data tiering) on logical volumes composed by block-layer means. = The previous art = Previously, there was only one method for scaling out local volumes - by block layer means. That is, file system deals only with virtual disk addresses (allocation, defragmentation, etc), and the block layer translates virtual disk addresses to real ones and backward. The most common complaint is about performance drop on such logical volumes, which are large and more than 70% full. Mostly it is related to disk space allocators, which are, to put it mildly, not excellent, and introduce big latency when searching for a free blocks on extra-large volumes. Moreover, nothing better has been invented for the past 30 years. Also, it easily may be that the best algorithms for free space management simply do not exist. Some file systems (ZFS and like) implement their own block layers. It helps to implement a failover, however, the mentioned problem doesn't disappear - if the block layer does its job very well, then the file system, again, faces a huge logical block device, which is hard to handle. Significant progress in scaling out was made by parallel network file systems (GPFS, Lustre, etc). However it was unclear, how to apply their technologies to a local FS. Mostly, it is because local file systems don't have such luxury like "backend storage" as the network ones do. What local FS does have - is only extremely poor interface of interaction with the block layer. For example, in Linux local FS can only compose and issue an IO request against some buffer (page). In other words, it was unclear, what a "local parallel file system" is. = Our approach. O(1) space allocator = In ~2010 we had realized that the first approach (implementation an own block layer inside a local FS) is totally wrong. Instead we need to pay attention to parallel network file systems to adopt their methods. However, as I mentioned, there is no something even close to a direct analogy - it means that for local FS we need to design "parallelism" from scratch. The same about distribution algorithms - we are totally unhappy with existing ones. Of course, you can deploy a networking FS on the same local machine for a number of block devices, but it will be something not serious. We state that a serious analogy can be defined and implemented in properly designed local FS - meet Reiser5. The basic idea is pretty simple - to not mess with large free space maps (whose sizes depend on the volume size). Instead, we need to manage many small ones of limited size. At any moment the file system should be able to pick up a proper such small space map, and work only with it. Needless to say, that for any logical volume, which is as big as you want, search time in a such map will be also limited by some value, which doesn't depend on logical volume size. For this reason, we'll call it O(1) - space allocator. The simplest way is to maintain one space map per each block device, which is a component of the logical volume. If some device is too large, simply split it into a number of partitions to make sure that any space map does not exceed some upper limit. Thus, users also should put some efforts from their side to make the space allocator be O(1). = Parallel scaling out as disk resources conservation. Definitions and examples = Here we'll consider an abstract subsystem S of the operating system managing a logical volume composed of one, or more removable components. ''Definition 1''. We'll say that S '''saves disk space and throughput''' of the logical volume, if 1) its data capacity is a sum of data capacities of its components 2) its disk bandwidth is a sum of disk bandwidths of its components We'll say that LV managed by such system is with '''parallel scaling out''' (PSO). There is a good analogy to understand the feature of PSO: imagine that it rains and you put several cylindrical buckets with different sized holes for collecting water. In this example raindrops represent IO-requests, the set of bucket represents a logical volume. Note that amount of water felt to each bucket is proportional to the square of its hole (considered as throughput). In this example all buckets are filled with water evenly and fairly: if one bucket is full, then other ones is also full. Note, that non-cylindrical form of buckets will likely break fairness of water distribution between them, so that PSO won't take place in this case. In practice, however, IO-systems are more complicated: IO requests are distributed, queued, etc. And conservation of disk resources usually doesn't take place: disk bandwidth of any logical volume turns out to be smaller than the sum of ones of its components. Nevertheless, if the loss of resources is small, and doesn't increase with the growth of the volume, then we'll say that such system features '''parallel scaling out'''. In complex IO-systems "leak" of disk bandwidth has complex nature and can happen on every its subsystem: on the file system, on the block layer, etc. The loss can also be caused by interface properties, etc. The fundamental reason of almost all resource leaks is that mentioned subsystems were poorly designed (because better algorithms were not known at that moment, or because of other reasons). The classic example of disk space and throughput loss is RAID arrays. Linear RAID saves disk space, but always drops disk bandwidth. RAID-0, composed of different size and bandwidth devices, drops both, disk space and disk bandwidth of the resulted logical volume. The same is for their modifications like RAID-5. In all mentioned examples the loss of disk bandwidth is caused by poor algorithms (specifically, by the fact that IO requests are directed to every component in wrong proportions). ''Definition 2''. A file system managing a logical volume is said to be with '''parallel scaling out''', if it saves disk space and bandwidth of that logical volume. In other words, if it doesn't drop the mentioned disk resources. Note that file system is only a part of an IO-subsystem. And it can easily happen that the file system saves disk resources, while the whole system is not. For example, because of poorly designed block layer, who puts IO requests issued for different block devices to the same queue on a local machine, etc. As an example, let's calculate disk bandwidth of a logical volume composed of 2 devices, the first of which has disk bandwidth 200M/sec, second - 300M/sec. We'll consider 3 systems: in the first one the mentioned devices compose linear RAID, in the second one - striped RAID (RAID-0), in the third one they are managed by a file system with parallel scaling out. Linear RAID distributes IO requests not evenly: first we write to the first device. Once it is full, we write to the second one. Disk bandwidth of linear RAID is defined by the throughput of the device we are currently writing to. Thus it always is not more than throughput of the faster device, i.e. 300 M/sec. RAID-0 distributes IO requests evenly (but not fairly). In the same interval of time the same number N/2 of IO-requests will be issued against each device. On the first device it will be written in N/400 sec. On the second device it will be written in N/600 sec. Note that the first device is slower, therefore we should wait N/400 sec for all N IO-requests to be written to the array. So throughput of RAID-0 in our case is N/(N/400) = 400 M/sec. FS with parallel scaling out distributes IO requests evenly and fairly. In the same interval of time the number of blocks issued against each device is N*C, where C is relative throughput of the device. Relative throughput of the first device is 200/(200+300) = 0.4. Of the second one - 300/(200+300) = 0.6 Portion of IO-requests issued for each device will be written in parallel in the same time 0.4N/200 = 0.6N/300 sec. Therefore, throughput of our logical volume in this case is N/(0.4N/200) = 500 M/sec. The resulted table of throughput: Linear RAID: <300 M/sec RAID-0: 400 M/sec Parallel scaling out FS 500 M/sec According to definitions above any local file system built on a top of RAID/LVM does NOT possesses parallel scaling out (first, because RAID and LVM don't save disk resources, second, because latency introduced by free space allocator grows with volume. For the same reasons any local FS, which implements its own block layer (ZFS, Btrfs, etc) does NOT possesses parallel scaling out. Note that any network FS built on a top of two or more local FS managing simple partitions as backend saves disk resources. = Overhead of parallelism for local FS = As we mentioned above, the characteristic feature of any FS with PSO is that before adding a device to a logical volume you should format it. Of course, it adds some overhead to the system. However, that overhead is not dramatically large. Specifically, with reiser4 disk format40 specifications the disk overhead includes 80K at the beginning of each device-component. Next, for each device Reiser5 reads on-disk super-block and loads its to memory, Thus, memory overhead includes one persistent memory super-block (~500 bytes) per each device-component of a logical volume. That is, a logical volume composed of one million devices will take ~500M of memory (pinned). I think that a person maintaining such volume will be able to find $30 for additional memory card. That overhead is a single disadvantage of FS with PSO. At least, we don't know other ones. = Аsymmetric logical volumes. Data and meta-data bricks = So, any logical volume with parallel scaling out is composed of block devices formatted by mkfs utility. Such device has a special name '''brick''', or '''storage subvolume''' of a logical volume. For the beginning we have implemented the simplest approach, when meta-data is located on dedicated block devices - we'll call them '''meta-data bricks'''. I remind that in reiser4 the notion of "meta-data" includes all kind of items (key'ed records in the storage tree). And the notion of data means unformatted blocks pointed out by "extent pointers". Such unformatted nodes are used to store bodies of regular files. Meta-data bricks are allowed to contain unformatted data blocks. '''Data bricks''' contain only unformatted data blocks. For obvious reasons such logical volumes are called "asymmetric". = Stripes. Fibers. Distribution, allocation and migration. Basic definitions = '''Stripe''' is a logical unit of distribution, that is a minimal object, any parts of which can not be stored on different bricks. A set of distribution units got to the same brick is called '''fiber'''. Comment. In the previous art fibers were called stripes (case of RAID-0), and logical units of distribution didn't have a special name. For a number of adjacent sectors forming such a unit a notion of "stripe width" was used. '''Data stripe''' is a logical unit of distribution in a file. '''Meta-data striping''' also can be defined, but we don't consider it here for simplicity. '''File system block''' is, as usual, an allocation unit on some brick. Stripe is said '''allocated''', if all its logical parts got disk addresses on some brick. From these definitions it directly follows that file system block can not contain more than one stripe. On the other hand, an allocated stripe can occupy many blocks. For any file system block its '''full address''' in a logical volume is defined as a pair (brick-ID, disk-address). Stripe is said '''distributed''', if it got the first component (brick-ID) of its full address in the logical volume. Stripe is said '''migrated''', if its old disk addresses got released, and new ones (possibly on another brick) got allocated. The core difference between parallel and non-parallel scaling out in terms of distribution and allocation: In systems with parallel scaling out any stripe firstly gets distributed, then allocated. In systems with non-parallel scaling out it is other way around - any stripe firstly gets allocated, then distributed. An example of a system with not parallel scaling out is any local FS built a top of RAID-0 array. Indeed, at first, such FS allocates a virtual disk address for a logical block, then the block layer assigns a real device-ID and translates that virtual address to real one. = Data distribution and migration. Fiber-Striping. Burst Buffers = Distribution defines what device-component of a logical volume an IO request composed for a dirty buffer(page) will be issued against. In file systems with PSO "destination" device is always defined by a virtual disk address allocated for that request. E.g. for RAID-0 ID of destination device is defined as (N % M), where N is a virtual address (block number), allocated by the file system, M is number of disks in the array. In our approach (O(1) space allocator) we allocate disk addresses on each physical device independently, so for every IO-request we first need to assign a destination device, then ask a block allocator managing that device to allocate a block number for this request. So, in our approach distribution doesn't depend on allocation. By default Reiser5 offers distribution based on algorithms (so-called '''fiber-striping''') invented by Eduard Shishkin (patented stuff). With our algorithms all your data will be distributed evenly and fairly among all devices-components of the logical volume. It means that portion of IO requests issued against each device is equal to relative capacity of that device assigned by user. Operation of adding/removing a device to/from a logical volume automatically invokes data migration, so that resulted distribution is also fair. Portion of migrated data is always equal to relative capacity of the added/removed device. The speed of data migration is mostly determined by throughput of the device to be added/removed. Alternatively, Reiser4 allows users to control data distribution and migration themselves. An important application distribution and migration find as '''data tiering''' in HPC area as so-called '''Burst Buffers''' (dump of "hot data" on high-performance proxy-device with its following migration to "persistent storage" in background mode). In all cases the file system memorizes stripes location. = Atomicity of volume operations = Almost all volume operations (adding/removing a brick, changing bricks capacity, etc) involve re-balancing (i.e. massive migration of data blocks), so it is technically difficult to implement full atomicity of such operations. Instead, we issue 2 checkpoints (first before re-balancing, second - after), and handle 2 cases depending on where in relation to those points the volume operation was interrupted. In the first case user should repeat the operation again, in the second case user should complete the operation (in the background mode) using volume.reiser4 utility. See administration guide on reiser4 logical volumes for details. = Limitations on asymmetric logical volumes = Maximal number of bricks in a logical volume: in the "builtin" distribution mode - 2^32 in the "custom" distribution mode - 2^64 In the "builtin" distribution mode any 2 bricks of the same logical volume can not differ in size more than 2^19 (~1 million) times. For example, your logical volume can not contain both, 1M and 2T bricks. Maximal number of stripe pointers held by one 4K-metadata block: 75 (for node40 format). Maximal number of data blocks served by 1 meta-data block: 75*S, where S is stripe width in file system blocks. For example, for 128K- stripes and 4K blocks (S=32) one meta-data block can serve not more than 2400 data blocks. In particular, when all bricks are of equal capacity, it means that one meta-data brick can serve not more than 2400 data bricks. For the best quality of "builtin" distribution it is recommended that: a) stripe size is not larger than 1/10000 of total volume size. b) number of bricks in your logical volume is a power of 2 (i.e. 2, 4, 8, 16, etc). If you cannot afford it, then make sure that number of hash space segments (a property of your logical volume, which can be increased online) is not smaller than 100 * number-of-bricks. Not more than one volume operations on the same logical volume can be executed in parallel. If some volume operation is not completed, then attempts to execute other ones will return error (EBUSY). = Security issues = "Builtin" distribution combines random and deterministic methods. It is "salted" with volume-ID, which is known only to root. Once it is compromised (revealed), the logical volume can be subjected to "free space attack" - with known volume-ID an attacker (non-privileged user) will be able to fill some data brick up to 100%, while others have a lot of free space. Thus, nobody will be able to write anymore to that volume. So, keep your volume-ID a secret! = Software and Disk Version 5.1.3. Compatibility = To implement parallel scaling out we upgraded Reiser4 code base with the following new plugins: 1) "asymmetric" volume plugin (new interface); 2) "fsx32" distribution plugin (new interface); 3) "striped" file plugin (existing interface); 4) "extent41" item plugin (existing interface); 5) "format41" disk format plugin (existing interface). In the best traditions we increment version numbers. The old disk and software version was 4.0.2. "Minor" number (2) is incremented because of (1-4). "Major" number (0) is incremented because of (5) and changes in the format super-block. "Principal" number (4) is incremented because of changes in master super-block. For more details about compatibility see [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model this] Old reiser4 partitions (of format 4.0.X) will be supported by Reiser5 kernel module. For this you need to enable option "support "Plan-A key allocation scheme" (not default), when configuring the kernel. Note that it will automatically disable support of logical volumes. Such mutual exclusiveness is due to performance reasons. Reiser4progs of software release number 5.X.Y don't support old reiser4 partitions of format 4.0.X. To fsck the last ones use reiser4progs of software release number 4.X.Y - it will exist as a separate branch. = TODO = . Upgrading FSCK to work with logical volumes; . Asymmetric LV w/ more than 1 meta-data brick per-volume; . Symmetric logical volumes (meta-data on all bricks); . 3D-snapshots of LV (snapshots with an ability to roll back not only file operations, but also volume operations); . Global (networking) logical volumes. [[category:Reiser4]] 67062721e5825925f174aef1d3c99f68ea118450 4450 4436 2020-12-10T01:36:55Z Edward 4 /* Stripes. Fibers. Distribution, allocation and migration. Basic definitions */ Reiser4 offers a brand new method of aggregation of block devices into logical volumes on a local machine. It is a qualitatively new level in file systems (and operating systems) development - local volumes with '''parallel scaling out'''. Reiser4 doesn't implement its own block layer like ZFS etc. In our approach scaling out is performed by file system means, rather than by block layer means. The flow of IO-requests issued against each device is controlled by user. To add a device to a logical volume with parallel scaling out, you first need to format that device - this is the difference between parallel and non-parallel scaling at first glance. The principal difference between parallel and non-parallel scaling out will be discussed below. Systems with parallel scaling out provide better scalability and resolve a number of problems inherent to non-parallel ones. In logical volumes with parallel scaling out devices of smaller size and(or) throughput don't become a "bottlenecks", as it happens e.g. in RAID-0 and its popular modifications. = Fundamental shortcomings of logical volumes composed by block-layer means = 1. Local file systems don't take participation in scaling out. They just face a huge virtual block device, for which they need to maintain a free space map. Such maps grow as the volume fills with data. It results in increasing latency on free blocks allocation and consequently in essential performance drop on large volumes which are almost full. 2. Loss of disk resources (space and throughput) on logical volumes composed of devices with different physical and geometric parameters (because of poor translations provided by "classic" RAID levels). Low-performance devices become a bottleneck in RAID arrays Attempts to replace RAID levels with better algorithms lead to inevitable and unfixable degradation of logical volumes. Indeed, defragmentation tools work only on the space of virtual disk addresses. If you use classic RAID levels, then everything is fine here: reducing fragmentation on virtual device always results in reducing fragmentation on physical ones. However, if you use more sophisticated translations to save disk space and bandwidth, then fragmentations on real devices tends to accumulate, and you are not able to reduce it just defragmenting the virtual device. Note that the interest is always real devices - no one actually cares what happens on virtual ones. 3. With only block layer means it is impossible to build heterogeneous storage composed of devices of different nature. You are not able to use different approaches to devices-components of the same logical volume (e.g. defragment only rotational drives, issue discard requests only for solid state drives, etc). 4. It is impossible to efficiently implement data migration (and, hence, data tiering) on logical volumes composed by block-layer means. = The previous art = Previously, there was only one method for scaling out local volumes - by block layer means. That is, file system deals only with virtual disk addresses (allocation, defragmentation, etc), and the block layer translates virtual disk addresses to real ones and backward. The most common complaint is about performance drop on such logical volumes, which are large and more than 70% full. Mostly it is related to disk space allocators, which are, to put it mildly, not excellent, and introduce big latency when searching for a free blocks on extra-large volumes. Moreover, nothing better has been invented for the past 30 years. Also, it easily may be that the best algorithms for free space management simply do not exist. Some file systems (ZFS and like) implement their own block layers. It helps to implement a failover, however, the mentioned problem doesn't disappear - if the block layer does its job very well, then the file system, again, faces a huge logical block device, which is hard to handle. Significant progress in scaling out was made by parallel network file systems (GPFS, Lustre, etc). However it was unclear, how to apply their technologies to a local FS. Mostly, it is because local file systems don't have such luxury like "backend storage" as the network ones do. What local FS does have - is only extremely poor interface of interaction with the block layer. For example, in Linux local FS can only compose and issue an IO request against some buffer (page). In other words, it was unclear, what a "local parallel file system" is. = Our approach. O(1) space allocator = In ~2010 we had realized that the first approach (implementation an own block layer inside a local FS) is totally wrong. Instead we need to pay attention to parallel network file systems to adopt their methods. However, as I mentioned, there is no something even close to a direct analogy - it means that for local FS we need to design "parallelism" from scratch. The same about distribution algorithms - we are totally unhappy with existing ones. Of course, you can deploy a networking FS on the same local machine for a number of block devices, but it will be something not serious. We state that a serious analogy can be defined and implemented in properly designed local FS - meet Reiser5. The basic idea is pretty simple - to not mess with large free space maps (whose sizes depend on the volume size). Instead, we need to manage many small ones of limited size. At any moment the file system should be able to pick up a proper such small space map, and work only with it. Needless to say, that for any logical volume, which is as big as you want, search time in a such map will be also limited by some value, which doesn't depend on logical volume size. For this reason, we'll call it O(1) - space allocator. The simplest way is to maintain one space map per each block device, which is a component of the logical volume. If some device is too large, simply split it into a number of partitions to make sure that any space map does not exceed some upper limit. Thus, users also should put some efforts from their side to make the space allocator be O(1). = Parallel scaling out as disk resources conservation. Definitions and examples = Here we'll consider an abstract subsystem S of the operating system managing a logical volume composed of one, or more removable components. ''Definition 1''. We'll say that S '''saves disk space and throughput''' of the logical volume, if 1) its data capacity is a sum of data capacities of its components 2) its disk bandwidth is a sum of disk bandwidths of its components We'll say that LV managed by such system is with '''parallel scaling out''' (PSO). There is a good analogy to understand the feature of PSO: imagine that it rains and you put several cylindrical buckets with different sized holes for collecting water. In this example raindrops represent IO-requests, the set of bucket represents a logical volume. Note that amount of water felt to each bucket is proportional to the square of its hole (considered as throughput). In this example all buckets are filled with water evenly and fairly: if one bucket is full, then other ones is also full. Note, that non-cylindrical form of buckets will likely break fairness of water distribution between them, so that PSO won't take place in this case. In practice, however, IO-systems are more complicated: IO requests are distributed, queued, etc. And conservation of disk resources usually doesn't take place: disk bandwidth of any logical volume turns out to be smaller than the sum of ones of its components. Nevertheless, if the loss of resources is small, and doesn't increase with the growth of the volume, then we'll say that such system features '''parallel scaling out'''. In complex IO-systems "leak" of disk bandwidth has complex nature and can happen on every its subsystem: on the file system, on the block layer, etc. The loss can also be caused by interface properties, etc. The fundamental reason of almost all resource leaks is that mentioned subsystems were poorly designed (because better algorithms were not known at that moment, or because of other reasons). The classic example of disk space and throughput loss is RAID arrays. Linear RAID saves disk space, but always drops disk bandwidth. RAID-0, composed of different size and bandwidth devices, drops both, disk space and disk bandwidth of the resulted logical volume. The same is for their modifications like RAID-5. In all mentioned examples the loss of disk bandwidth is caused by poor algorithms (specifically, by the fact that IO requests are directed to every component in wrong proportions). ''Definition 2''. A file system managing a logical volume is said to be with '''parallel scaling out''', if it saves disk space and bandwidth of that logical volume. In other words, if it doesn't drop the mentioned disk resources. Note that file system is only a part of an IO-subsystem. And it can easily happen that the file system saves disk resources, while the whole system is not. For example, because of poorly designed block layer, who puts IO requests issued for different block devices to the same queue on a local machine, etc. As an example, let's calculate disk bandwidth of a logical volume composed of 2 devices, the first of which has disk bandwidth 200M/sec, second - 300M/sec. We'll consider 3 systems: in the first one the mentioned devices compose linear RAID, in the second one - striped RAID (RAID-0), in the third one they are managed by a file system with parallel scaling out. Linear RAID distributes IO requests not evenly: first we write to the first device. Once it is full, we write to the second one. Disk bandwidth of linear RAID is defined by the throughput of the device we are currently writing to. Thus it always is not more than throughput of the faster device, i.e. 300 M/sec. RAID-0 distributes IO requests evenly (but not fairly). In the same interval of time the same number N/2 of IO-requests will be issued against each device. On the first device it will be written in N/400 sec. On the second device it will be written in N/600 sec. Note that the first device is slower, therefore we should wait N/400 sec for all N IO-requests to be written to the array. So throughput of RAID-0 in our case is N/(N/400) = 400 M/sec. FS with parallel scaling out distributes IO requests evenly and fairly. In the same interval of time the number of blocks issued against each device is N*C, where C is relative throughput of the device. Relative throughput of the first device is 200/(200+300) = 0.4. Of the second one - 300/(200+300) = 0.6 Portion of IO-requests issued for each device will be written in parallel in the same time 0.4N/200 = 0.6N/300 sec. Therefore, throughput of our logical volume in this case is N/(0.4N/200) = 500 M/sec. The resulted table of throughput: Linear RAID: <300 M/sec RAID-0: 400 M/sec Parallel scaling out FS 500 M/sec According to definitions above any local file system built on a top of RAID/LVM does NOT possesses parallel scaling out (first, because RAID and LVM don't save disk resources, second, because latency introduced by free space allocator grows with volume. For the same reasons any local FS, which implements its own block layer (ZFS, Btrfs, etc) does NOT possesses parallel scaling out. Note that any network FS built on a top of two or more local FS managing simple partitions as backend saves disk resources. = Overhead of parallelism for local FS = As we mentioned above, the characteristic feature of any FS with PSO is that before adding a device to a logical volume you should format it. Of course, it adds some overhead to the system. However, that overhead is not dramatically large. Specifically, with reiser4 disk format40 specifications the disk overhead includes 80K at the beginning of each device-component. Next, for each device Reiser5 reads on-disk super-block and loads its to memory, Thus, memory overhead includes one persistent memory super-block (~500 bytes) per each device-component of a logical volume. That is, a logical volume composed of one million devices will take ~500M of memory (pinned). I think that a person maintaining such volume will be able to find $30 for additional memory card. That overhead is a single disadvantage of FS with PSO. At least, we don't know other ones. = Аsymmetric logical volumes. Data and meta-data bricks = So, any logical volume with parallel scaling out is composed of block devices formatted by mkfs utility. Such device has a special name '''brick''', or '''storage subvolume''' of a logical volume. For the beginning we have implemented the simplest approach, when meta-data is located on dedicated block devices - we'll call them '''meta-data bricks'''. I remind that in reiser4 the notion of "meta-data" includes all kind of items (key'ed records in the storage tree). And the notion of data means unformatted blocks pointed out by "extent pointers". Such unformatted nodes are used to store bodies of regular files. Meta-data bricks are allowed to contain unformatted data blocks. '''Data bricks''' contain only unformatted data blocks. For obvious reasons such logical volumes are called "asymmetric". = Stripes. Fibers. Distribution, allocation and migration. Basic definitions = '''Stripe''' is a logical unit of distribution, that is a minimal object, any parts of which can not be stored on different bricks. A set of distribution units got to the same brick is called '''fiber'''. Comment. In the previous art fibers were called stripes (case of RAID-0), and logical units of distribution didn't have a special name. For a number of adjacent sectors forming such a unit a notion of "stripe width" was used. '''Data stripe''' is a logical block of some size at some offset in a file. '''Meta-data striping''' also can be defined, but we don't consider it here for simplicity. '''File system block''' is, as usual, an allocation unit on some brick. Stripe is said '''allocated''', if all its logical parts got disk addresses on some brick. From these definitions it directly follows that file system block can not contain more than one stripe. On the other hand, an allocated stripe can occupy many blocks. For any file system block its '''full address''' in a logical volume is defined as a pair (brick-ID, disk-address). Stripe is said '''distributed''', if it got the first component (brick-ID) of its full address in the logical volume. Stripe is said '''migrated''', if its old disk addresses got released, and new ones (possibly on another brick) got allocated. The core difference between parallel and non-parallel scaling out in terms of distribution and allocation: In systems with parallel scaling out any stripe firstly gets distributed, then allocated. In systems with non-parallel scaling out it is other way around - any stripe firstly gets allocated, then distributed. An example of a system with not parallel scaling out is any local FS built a top of RAID-0 array. Indeed, at first, such FS allocates a virtual disk address for a logical block, then the block layer assigns a real device-ID and translates that virtual address to real one. = Data distribution and migration. Fiber-Striping. Burst Buffers = Distribution defines what device-component of a logical volume an IO request composed for a dirty buffer(page) will be issued against. In file systems with PSO "destination" device is always defined by a virtual disk address allocated for that request. E.g. for RAID-0 ID of destination device is defined as (N % M), where N is a virtual address (block number), allocated by the file system, M is number of disks in the array. In our approach (O(1) space allocator) we allocate disk addresses on each physical device independently, so for every IO-request we first need to assign a destination device, then ask a block allocator managing that device to allocate a block number for this request. So, in our approach distribution doesn't depend on allocation. By default Reiser5 offers distribution based on algorithms (so-called '''fiber-striping''') invented by Eduard Shishkin (patented stuff). With our algorithms all your data will be distributed evenly and fairly among all devices-components of the logical volume. It means that portion of IO requests issued against each device is equal to relative capacity of that device assigned by user. Operation of adding/removing a device to/from a logical volume automatically invokes data migration, so that resulted distribution is also fair. Portion of migrated data is always equal to relative capacity of the added/removed device. The speed of data migration is mostly determined by throughput of the device to be added/removed. Alternatively, Reiser4 allows users to control data distribution and migration themselves. An important application distribution and migration find as '''data tiering''' in HPC area as so-called '''Burst Buffers''' (dump of "hot data" on high-performance proxy-device with its following migration to "persistent storage" in background mode). In all cases the file system memorizes stripes location. = Atomicity of volume operations = Almost all volume operations (adding/removing a brick, changing bricks capacity, etc) involve re-balancing (i.e. massive migration of data blocks), so it is technically difficult to implement full atomicity of such operations. Instead, we issue 2 checkpoints (first before re-balancing, second - after), and handle 2 cases depending on where in relation to those points the volume operation was interrupted. In the first case user should repeat the operation again, in the second case user should complete the operation (in the background mode) using volume.reiser4 utility. See administration guide on reiser4 logical volumes for details. = Limitations on asymmetric logical volumes = Maximal number of bricks in a logical volume: in the "builtin" distribution mode - 2^32 in the "custom" distribution mode - 2^64 In the "builtin" distribution mode any 2 bricks of the same logical volume can not differ in size more than 2^19 (~1 million) times. For example, your logical volume can not contain both, 1M and 2T bricks. Maximal number of stripe pointers held by one 4K-metadata block: 75 (for node40 format). Maximal number of data blocks served by 1 meta-data block: 75*S, where S is stripe width in file system blocks. For example, for 128K- stripes and 4K blocks (S=32) one meta-data block can serve not more than 2400 data blocks. In particular, when all bricks are of equal capacity, it means that one meta-data brick can serve not more than 2400 data bricks. For the best quality of "builtin" distribution it is recommended that: a) stripe size is not larger than 1/10000 of total volume size. b) number of bricks in your logical volume is a power of 2 (i.e. 2, 4, 8, 16, etc). If you cannot afford it, then make sure that number of hash space segments (a property of your logical volume, which can be increased online) is not smaller than 100 * number-of-bricks. Not more than one volume operations on the same logical volume can be executed in parallel. If some volume operation is not completed, then attempts to execute other ones will return error (EBUSY). = Security issues = "Builtin" distribution combines random and deterministic methods. It is "salted" with volume-ID, which is known only to root. Once it is compromised (revealed), the logical volume can be subjected to "free space attack" - with known volume-ID an attacker (non-privileged user) will be able to fill some data brick up to 100%, while others have a lot of free space. Thus, nobody will be able to write anymore to that volume. So, keep your volume-ID a secret! = Software and Disk Version 5.1.3. Compatibility = To implement parallel scaling out we upgraded Reiser4 code base with the following new plugins: 1) "asymmetric" volume plugin (new interface); 2) "fsx32" distribution plugin (new interface); 3) "striped" file plugin (existing interface); 4) "extent41" item plugin (existing interface); 5) "format41" disk format plugin (existing interface). In the best traditions we increment version numbers. The old disk and software version was 4.0.2. "Minor" number (2) is incremented because of (1-4). "Major" number (0) is incremented because of (5) and changes in the format super-block. "Principal" number (4) is incremented because of changes in master super-block. For more details about compatibility see [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model this] Old reiser4 partitions (of format 4.0.X) will be supported by Reiser5 kernel module. For this you need to enable option "support "Plan-A key allocation scheme" (not default), when configuring the kernel. Note that it will automatically disable support of logical volumes. Such mutual exclusiveness is due to performance reasons. Reiser4progs of software release number 5.X.Y don't support old reiser4 partitions of format 4.0.X. To fsck the last ones use reiser4progs of software release number 4.X.Y - it will exist as a separate branch. = TODO = . Upgrading FSCK to work with logical volumes; . Asymmetric LV w/ more than 1 meta-data brick per-volume; . Symmetric logical volumes (meta-data on all bricks); . 3D-snapshots of LV (snapshots with an ability to roll back not only file operations, but also volume operations); . Global (networking) logical volumes. [[category:Reiser4]] 674a210551c0eccb25aa12a93e59cccb91d70360 4436 4391 2020-11-12T13:05:06Z Edward 4 /* TODO */ Reiser4 offers a brand new method of aggregation of block devices into logical volumes on a local machine. It is a qualitatively new level in file systems (and operating systems) development - local volumes with '''parallel scaling out'''. Reiser4 doesn't implement its own block layer like ZFS etc. In our approach scaling out is performed by file system means, rather than by block layer means. The flow of IO-requests issued against each device is controlled by user. To add a device to a logical volume with parallel scaling out, you first need to format that device - this is the difference between parallel and non-parallel scaling at first glance. The principal difference between parallel and non-parallel scaling out will be discussed below. Systems with parallel scaling out provide better scalability and resolve a number of problems inherent to non-parallel ones. In logical volumes with parallel scaling out devices of smaller size and(or) throughput don't become a "bottlenecks", as it happens e.g. in RAID-0 and its popular modifications. = Fundamental shortcomings of logical volumes composed by block-layer means = 1. Local file systems don't take participation in scaling out. They just face a huge virtual block device, for which they need to maintain a free space map. Such maps grow as the volume fills with data. It results in increasing latency on free blocks allocation and consequently in essential performance drop on large volumes which are almost full. 2. Loss of disk resources (space and throughput) on logical volumes composed of devices with different physical and geometric parameters (because of poor translations provided by "classic" RAID levels). Low-performance devices become a bottleneck in RAID arrays Attempts to replace RAID levels with better algorithms lead to inevitable and unfixable degradation of logical volumes. Indeed, defragmentation tools work only on the space of virtual disk addresses. If you use classic RAID levels, then everything is fine here: reducing fragmentation on virtual device always results in reducing fragmentation on physical ones. However, if you use more sophisticated translations to save disk space and bandwidth, then fragmentations on real devices tends to accumulate, and you are not able to reduce it just defragmenting the virtual device. Note that the interest is always real devices - no one actually cares what happens on virtual ones. 3. With only block layer means it is impossible to build heterogeneous storage composed of devices of different nature. You are not able to use different approaches to devices-components of the same logical volume (e.g. defragment only rotational drives, issue discard requests only for solid state drives, etc). 4. It is impossible to efficiently implement data migration (and, hence, data tiering) on logical volumes composed by block-layer means. = The previous art = Previously, there was only one method for scaling out local volumes - by block layer means. That is, file system deals only with virtual disk addresses (allocation, defragmentation, etc), and the block layer translates virtual disk addresses to real ones and backward. The most common complaint is about performance drop on such logical volumes, which are large and more than 70% full. Mostly it is related to disk space allocators, which are, to put it mildly, not excellent, and introduce big latency when searching for a free blocks on extra-large volumes. Moreover, nothing better has been invented for the past 30 years. Also, it easily may be that the best algorithms for free space management simply do not exist. Some file systems (ZFS and like) implement their own block layers. It helps to implement a failover, however, the mentioned problem doesn't disappear - if the block layer does its job very well, then the file system, again, faces a huge logical block device, which is hard to handle. Significant progress in scaling out was made by parallel network file systems (GPFS, Lustre, etc). However it was unclear, how to apply their technologies to a local FS. Mostly, it is because local file systems don't have such luxury like "backend storage" as the network ones do. What local FS does have - is only extremely poor interface of interaction with the block layer. For example, in Linux local FS can only compose and issue an IO request against some buffer (page). In other words, it was unclear, what a "local parallel file system" is. = Our approach. O(1) space allocator = In ~2010 we had realized that the first approach (implementation an own block layer inside a local FS) is totally wrong. Instead we need to pay attention to parallel network file systems to adopt their methods. However, as I mentioned, there is no something even close to a direct analogy - it means that for local FS we need to design "parallelism" from scratch. The same about distribution algorithms - we are totally unhappy with existing ones. Of course, you can deploy a networking FS on the same local machine for a number of block devices, but it will be something not serious. We state that a serious analogy can be defined and implemented in properly designed local FS - meet Reiser5. The basic idea is pretty simple - to not mess with large free space maps (whose sizes depend on the volume size). Instead, we need to manage many small ones of limited size. At any moment the file system should be able to pick up a proper such small space map, and work only with it. Needless to say, that for any logical volume, which is as big as you want, search time in a such map will be also limited by some value, which doesn't depend on logical volume size. For this reason, we'll call it O(1) - space allocator. The simplest way is to maintain one space map per each block device, which is a component of the logical volume. If some device is too large, simply split it into a number of partitions to make sure that any space map does not exceed some upper limit. Thus, users also should put some efforts from their side to make the space allocator be O(1). = Parallel scaling out as disk resources conservation. Definitions and examples = Here we'll consider an abstract subsystem S of the operating system managing a logical volume composed of one, or more removable components. ''Definition 1''. We'll say that S '''saves disk space and throughput''' of the logical volume, if 1) its data capacity is a sum of data capacities of its components 2) its disk bandwidth is a sum of disk bandwidths of its components We'll say that LV managed by such system is with '''parallel scaling out''' (PSO). There is a good analogy to understand the feature of PSO: imagine that it rains and you put several cylindrical buckets with different sized holes for collecting water. In this example raindrops represent IO-requests, the set of bucket represents a logical volume. Note that amount of water felt to each bucket is proportional to the square of its hole (considered as throughput). In this example all buckets are filled with water evenly and fairly: if one bucket is full, then other ones is also full. Note, that non-cylindrical form of buckets will likely break fairness of water distribution between them, so that PSO won't take place in this case. In practice, however, IO-systems are more complicated: IO requests are distributed, queued, etc. And conservation of disk resources usually doesn't take place: disk bandwidth of any logical volume turns out to be smaller than the sum of ones of its components. Nevertheless, if the loss of resources is small, and doesn't increase with the growth of the volume, then we'll say that such system features '''parallel scaling out'''. In complex IO-systems "leak" of disk bandwidth has complex nature and can happen on every its subsystem: on the file system, on the block layer, etc. The loss can also be caused by interface properties, etc. The fundamental reason of almost all resource leaks is that mentioned subsystems were poorly designed (because better algorithms were not known at that moment, or because of other reasons). The classic example of disk space and throughput loss is RAID arrays. Linear RAID saves disk space, but always drops disk bandwidth. RAID-0, composed of different size and bandwidth devices, drops both, disk space and disk bandwidth of the resulted logical volume. The same is for their modifications like RAID-5. In all mentioned examples the loss of disk bandwidth is caused by poor algorithms (specifically, by the fact that IO requests are directed to every component in wrong proportions). ''Definition 2''. A file system managing a logical volume is said to be with '''parallel scaling out''', if it saves disk space and bandwidth of that logical volume. In other words, if it doesn't drop the mentioned disk resources. Note that file system is only a part of an IO-subsystem. And it can easily happen that the file system saves disk resources, while the whole system is not. For example, because of poorly designed block layer, who puts IO requests issued for different block devices to the same queue on a local machine, etc. As an example, let's calculate disk bandwidth of a logical volume composed of 2 devices, the first of which has disk bandwidth 200M/sec, second - 300M/sec. We'll consider 3 systems: in the first one the mentioned devices compose linear RAID, in the second one - striped RAID (RAID-0), in the third one they are managed by a file system with parallel scaling out. Linear RAID distributes IO requests not evenly: first we write to the first device. Once it is full, we write to the second one. Disk bandwidth of linear RAID is defined by the throughput of the device we are currently writing to. Thus it always is not more than throughput of the faster device, i.e. 300 M/sec. RAID-0 distributes IO requests evenly (but not fairly). In the same interval of time the same number N/2 of IO-requests will be issued against each device. On the first device it will be written in N/400 sec. On the second device it will be written in N/600 sec. Note that the first device is slower, therefore we should wait N/400 sec for all N IO-requests to be written to the array. So throughput of RAID-0 in our case is N/(N/400) = 400 M/sec. FS with parallel scaling out distributes IO requests evenly and fairly. In the same interval of time the number of blocks issued against each device is N*C, where C is relative throughput of the device. Relative throughput of the first device is 200/(200+300) = 0.4. Of the second one - 300/(200+300) = 0.6 Portion of IO-requests issued for each device will be written in parallel in the same time 0.4N/200 = 0.6N/300 sec. Therefore, throughput of our logical volume in this case is N/(0.4N/200) = 500 M/sec. The resulted table of throughput: Linear RAID: <300 M/sec RAID-0: 400 M/sec Parallel scaling out FS 500 M/sec According to definitions above any local file system built on a top of RAID/LVM does NOT possesses parallel scaling out (first, because RAID and LVM don't save disk resources, second, because latency introduced by free space allocator grows with volume. For the same reasons any local FS, which implements its own block layer (ZFS, Btrfs, etc) does NOT possesses parallel scaling out. Note that any network FS built on a top of two or more local FS managing simple partitions as backend saves disk resources. = Overhead of parallelism for local FS = As we mentioned above, the characteristic feature of any FS with PSO is that before adding a device to a logical volume you should format it. Of course, it adds some overhead to the system. However, that overhead is not dramatically large. Specifically, with reiser4 disk format40 specifications the disk overhead includes 80K at the beginning of each device-component. Next, for each device Reiser5 reads on-disk super-block and loads its to memory, Thus, memory overhead includes one persistent memory super-block (~500 bytes) per each device-component of a logical volume. That is, a logical volume composed of one million devices will take ~500M of memory (pinned). I think that a person maintaining such volume will be able to find $30 for additional memory card. That overhead is a single disadvantage of FS with PSO. At least, we don't know other ones. = Аsymmetric logical volumes. Data and meta-data bricks = So, any logical volume with parallel scaling out is composed of block devices formatted by mkfs utility. Such device has a special name '''brick''', or '''storage subvolume''' of a logical volume. For the beginning we have implemented the simplest approach, when meta-data is located on dedicated block devices - we'll call them '''meta-data bricks'''. I remind that in reiser4 the notion of "meta-data" includes all kind of items (key'ed records in the storage tree). And the notion of data means unformatted blocks pointed out by "extent pointers". Such unformatted nodes are used to store bodies of regular files. Meta-data bricks are allowed to contain unformatted data blocks. '''Data bricks''' contain only unformatted data blocks. For obvious reasons such logical volumes are called "asymmetric". = Stripes. Fibers. Distribution, allocation and migration. Basic definitions = '''Stripe''' is a logical unit of distribution, that is a minimal object, any parts of which can not be stored on different bricks. A set of distribution units dispatched to the same brick is called '''fiber'''. Comment. In the previous art fibers were called stripes (case of RAID-0), and logical units of distribution didn't have a special name. For a number of adjacent sectors forming such a unit a notion of "stripe width" was used. '''Data stripe''' is a logical block of some size at some offset in a file. '''Meta-data striping''' also can be defined, but we don't consider it here for simplicity. '''File system block''' is, as usual, an allocation unit on some brick. Stripe is said '''allocated''', if all its parts got disk addresses on some brick. From these definitions it directly follows that file system block can not contain more than one stripe. On the other hand, an allocated stripe can occupy many blocks. For any file system block its '''full address''' in a logical volume is defined as a pair (brick-ID, disk-address). Stripe is said '''dispatched''', if it got the first component (brick-ID) of its full address in the logical volume. Stripe is said '''migrated''', if its old disk addresses got released, and new ones (possibly on another brick) got allocated. The core difference between parallel and non-parallel scaling out in terms of distribution and allocation: In PSO-systems any stripe firstly gets distributed, then allocated. In systems with non-parallel scaling out it is other way around - any stripe firstly gets allocated, then distributed. An example is any local FS built a top of RAID-0 array. Indeed, at first, such FS allocates a virtual disk address for a logical block, then block layer assigns a real device-ID and translates that virtual address to real one. = Data distribution and migration. Fiber-Striping. Burst Buffers = Distribution defines what device-component of a logical volume an IO request composed for a dirty buffer(page) will be issued against. In file systems with PSO "destination" device is always defined by a virtual disk address allocated for that request. E.g. for RAID-0 ID of destination device is defined as (N % M), where N is a virtual address (block number), allocated by the file system, M is number of disks in the array. In our approach (O(1) space allocator) we allocate disk addresses on each physical device independently, so for every IO-request we first need to assign a destination device, then ask a block allocator managing that device to allocate a block number for this request. So, in our approach distribution doesn't depend on allocation. By default Reiser5 offers distribution based on algorithms (so-called '''fiber-striping''') invented by Eduard Shishkin (patented stuff). With our algorithms all your data will be distributed evenly and fairly among all devices-components of the logical volume. It means that portion of IO requests issued against each device is equal to relative capacity of that device assigned by user. Operation of adding/removing a device to/from a logical volume automatically invokes data migration, so that resulted distribution is also fair. Portion of migrated data is always equal to relative capacity of the added/removed device. The speed of data migration is mostly determined by throughput of the device to be added/removed. Alternatively, Reiser4 allows users to control data distribution and migration themselves. An important application distribution and migration find as '''data tiering''' in HPC area as so-called '''Burst Buffers''' (dump of "hot data" on high-performance proxy-device with its following migration to "persistent storage" in background mode). In all cases the file system memorizes stripes location. = Atomicity of volume operations = Almost all volume operations (adding/removing a brick, changing bricks capacity, etc) involve re-balancing (i.e. massive migration of data blocks), so it is technically difficult to implement full atomicity of such operations. Instead, we issue 2 checkpoints (first before re-balancing, second - after), and handle 2 cases depending on where in relation to those points the volume operation was interrupted. In the first case user should repeat the operation again, in the second case user should complete the operation (in the background mode) using volume.reiser4 utility. See administration guide on reiser4 logical volumes for details. = Limitations on asymmetric logical volumes = Maximal number of bricks in a logical volume: in the "builtin" distribution mode - 2^32 in the "custom" distribution mode - 2^64 In the "builtin" distribution mode any 2 bricks of the same logical volume can not differ in size more than 2^19 (~1 million) times. For example, your logical volume can not contain both, 1M and 2T bricks. Maximal number of stripe pointers held by one 4K-metadata block: 75 (for node40 format). Maximal number of data blocks served by 1 meta-data block: 75*S, where S is stripe width in file system blocks. For example, for 128K- stripes and 4K blocks (S=32) one meta-data block can serve not more than 2400 data blocks. In particular, when all bricks are of equal capacity, it means that one meta-data brick can serve not more than 2400 data bricks. For the best quality of "builtin" distribution it is recommended that: a) stripe size is not larger than 1/10000 of total volume size. b) number of bricks in your logical volume is a power of 2 (i.e. 2, 4, 8, 16, etc). If you cannot afford it, then make sure that number of hash space segments (a property of your logical volume, which can be increased online) is not smaller than 100 * number-of-bricks. Not more than one volume operations on the same logical volume can be executed in parallel. If some volume operation is not completed, then attempts to execute other ones will return error (EBUSY). = Security issues = "Builtin" distribution combines random and deterministic methods. It is "salted" with volume-ID, which is known only to root. Once it is compromised (revealed), the logical volume can be subjected to "free space attack" - with known volume-ID an attacker (non-privileged user) will be able to fill some data brick up to 100%, while others have a lot of free space. Thus, nobody will be able to write anymore to that volume. So, keep your volume-ID a secret! = Software and Disk Version 5.1.3. Compatibility = To implement parallel scaling out we upgraded Reiser4 code base with the following new plugins: 1) "asymmetric" volume plugin (new interface); 2) "fsx32" distribution plugin (new interface); 3) "striped" file plugin (existing interface); 4) "extent41" item plugin (existing interface); 5) "format41" disk format plugin (existing interface). In the best traditions we increment version numbers. The old disk and software version was 4.0.2. "Minor" number (2) is incremented because of (1-4). "Major" number (0) is incremented because of (5) and changes in the format super-block. "Principal" number (4) is incremented because of changes in master super-block. For more details about compatibility see [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model this] Old reiser4 partitions (of format 4.0.X) will be supported by Reiser5 kernel module. For this you need to enable option "support "Plan-A key allocation scheme" (not default), when configuring the kernel. Note that it will automatically disable support of logical volumes. Such mutual exclusiveness is due to performance reasons. Reiser4progs of software release number 5.X.Y don't support old reiser4 partitions of format 4.0.X. To fsck the last ones use reiser4progs of software release number 4.X.Y - it will exist as a separate branch. = TODO = . Upgrading FSCK to work with logical volumes; . Asymmetric LV w/ more than 1 meta-data brick per-volume; . Symmetric logical volumes (meta-data on all bricks); . 3D-snapshots of LV (snapshots with an ability to roll back not only file operations, but also volume operations); . Global (networking) logical volumes. [[category:Reiser4]] 0c32727d254b5c02452c2310613a31b259253baa 4391 4355 2020-08-16T23:33:17Z Edward 4 Reiser4 offers a brand new method of aggregation of block devices into logical volumes on a local machine. It is a qualitatively new level in file systems (and operating systems) development - local volumes with '''parallel scaling out'''. Reiser4 doesn't implement its own block layer like ZFS etc. In our approach scaling out is performed by file system means, rather than by block layer means. The flow of IO-requests issued against each device is controlled by user. To add a device to a logical volume with parallel scaling out, you first need to format that device - this is the difference between parallel and non-parallel scaling at first glance. The principal difference between parallel and non-parallel scaling out will be discussed below. Systems with parallel scaling out provide better scalability and resolve a number of problems inherent to non-parallel ones. In logical volumes with parallel scaling out devices of smaller size and(or) throughput don't become a "bottlenecks", as it happens e.g. in RAID-0 and its popular modifications. = Fundamental shortcomings of logical volumes composed by block-layer means = 1. Local file systems don't take participation in scaling out. They just face a huge virtual block device, for which they need to maintain a free space map. Such maps grow as the volume fills with data. It results in increasing latency on free blocks allocation and consequently in essential performance drop on large volumes which are almost full. 2. Loss of disk resources (space and throughput) on logical volumes composed of devices with different physical and geometric parameters (because of poor translations provided by "classic" RAID levels). Low-performance devices become a bottleneck in RAID arrays Attempts to replace RAID levels with better algorithms lead to inevitable and unfixable degradation of logical volumes. Indeed, defragmentation tools work only on the space of virtual disk addresses. If you use classic RAID levels, then everything is fine here: reducing fragmentation on virtual device always results in reducing fragmentation on physical ones. However, if you use more sophisticated translations to save disk space and bandwidth, then fragmentations on real devices tends to accumulate, and you are not able to reduce it just defragmenting the virtual device. Note that the interest is always real devices - no one actually cares what happens on virtual ones. 3. With only block layer means it is impossible to build heterogeneous storage composed of devices of different nature. You are not able to use different approaches to devices-components of the same logical volume (e.g. defragment only rotational drives, issue discard requests only for solid state drives, etc). 4. It is impossible to efficiently implement data migration (and, hence, data tiering) on logical volumes composed by block-layer means. = The previous art = Previously, there was only one method for scaling out local volumes - by block layer means. That is, file system deals only with virtual disk addresses (allocation, defragmentation, etc), and the block layer translates virtual disk addresses to real ones and backward. The most common complaint is about performance drop on such logical volumes, which are large and more than 70% full. Mostly it is related to disk space allocators, which are, to put it mildly, not excellent, and introduce big latency when searching for a free blocks on extra-large volumes. Moreover, nothing better has been invented for the past 30 years. Also, it easily may be that the best algorithms for free space management simply do not exist. Some file systems (ZFS and like) implement their own block layers. It helps to implement a failover, however, the mentioned problem doesn't disappear - if the block layer does its job very well, then the file system, again, faces a huge logical block device, which is hard to handle. Significant progress in scaling out was made by parallel network file systems (GPFS, Lustre, etc). However it was unclear, how to apply their technologies to a local FS. Mostly, it is because local file systems don't have such luxury like "backend storage" as the network ones do. What local FS does have - is only extremely poor interface of interaction with the block layer. For example, in Linux local FS can only compose and issue an IO request against some buffer (page). In other words, it was unclear, what a "local parallel file system" is. = Our approach. O(1) space allocator = In ~2010 we had realized that the first approach (implementation an own block layer inside a local FS) is totally wrong. Instead we need to pay attention to parallel network file systems to adopt their methods. However, as I mentioned, there is no something even close to a direct analogy - it means that for local FS we need to design "parallelism" from scratch. The same about distribution algorithms - we are totally unhappy with existing ones. Of course, you can deploy a networking FS on the same local machine for a number of block devices, but it will be something not serious. We state that a serious analogy can be defined and implemented in properly designed local FS - meet Reiser5. The basic idea is pretty simple - to not mess with large free space maps (whose sizes depend on the volume size). Instead, we need to manage many small ones of limited size. At any moment the file system should be able to pick up a proper such small space map, and work only with it. Needless to say, that for any logical volume, which is as big as you want, search time in a such map will be also limited by some value, which doesn't depend on logical volume size. For this reason, we'll call it O(1) - space allocator. The simplest way is to maintain one space map per each block device, which is a component of the logical volume. If some device is too large, simply split it into a number of partitions to make sure that any space map does not exceed some upper limit. Thus, users also should put some efforts from their side to make the space allocator be O(1). = Parallel scaling out as disk resources conservation. Definitions and examples = Here we'll consider an abstract subsystem S of the operating system managing a logical volume composed of one, or more removable components. ''Definition 1''. We'll say that S '''saves disk space and throughput''' of the logical volume, if 1) its data capacity is a sum of data capacities of its components 2) its disk bandwidth is a sum of disk bandwidths of its components We'll say that LV managed by such system is with '''parallel scaling out''' (PSO). There is a good analogy to understand the feature of PSO: imagine that it rains and you put several cylindrical buckets with different sized holes for collecting water. In this example raindrops represent IO-requests, the set of bucket represents a logical volume. Note that amount of water felt to each bucket is proportional to the square of its hole (considered as throughput). In this example all buckets are filled with water evenly and fairly: if one bucket is full, then other ones is also full. Note, that non-cylindrical form of buckets will likely break fairness of water distribution between them, so that PSO won't take place in this case. In practice, however, IO-systems are more complicated: IO requests are distributed, queued, etc. And conservation of disk resources usually doesn't take place: disk bandwidth of any logical volume turns out to be smaller than the sum of ones of its components. Nevertheless, if the loss of resources is small, and doesn't increase with the growth of the volume, then we'll say that such system features '''parallel scaling out'''. In complex IO-systems "leak" of disk bandwidth has complex nature and can happen on every its subsystem: on the file system, on the block layer, etc. The loss can also be caused by interface properties, etc. The fundamental reason of almost all resource leaks is that mentioned subsystems were poorly designed (because better algorithms were not known at that moment, or because of other reasons). The classic example of disk space and throughput loss is RAID arrays. Linear RAID saves disk space, but always drops disk bandwidth. RAID-0, composed of different size and bandwidth devices, drops both, disk space and disk bandwidth of the resulted logical volume. The same is for their modifications like RAID-5. In all mentioned examples the loss of disk bandwidth is caused by poor algorithms (specifically, by the fact that IO requests are directed to every component in wrong proportions). ''Definition 2''. A file system managing a logical volume is said to be with '''parallel scaling out''', if it saves disk space and bandwidth of that logical volume. In other words, if it doesn't drop the mentioned disk resources. Note that file system is only a part of an IO-subsystem. And it can easily happen that the file system saves disk resources, while the whole system is not. For example, because of poorly designed block layer, who puts IO requests issued for different block devices to the same queue on a local machine, etc. As an example, let's calculate disk bandwidth of a logical volume composed of 2 devices, the first of which has disk bandwidth 200M/sec, second - 300M/sec. We'll consider 3 systems: in the first one the mentioned devices compose linear RAID, in the second one - striped RAID (RAID-0), in the third one they are managed by a file system with parallel scaling out. Linear RAID distributes IO requests not evenly: first we write to the first device. Once it is full, we write to the second one. Disk bandwidth of linear RAID is defined by the throughput of the device we are currently writing to. Thus it always is not more than throughput of the faster device, i.e. 300 M/sec. RAID-0 distributes IO requests evenly (but not fairly). In the same interval of time the same number N/2 of IO-requests will be issued against each device. On the first device it will be written in N/400 sec. On the second device it will be written in N/600 sec. Note that the first device is slower, therefore we should wait N/400 sec for all N IO-requests to be written to the array. So throughput of RAID-0 in our case is N/(N/400) = 400 M/sec. FS with parallel scaling out distributes IO requests evenly and fairly. In the same interval of time the number of blocks issued against each device is N*C, where C is relative throughput of the device. Relative throughput of the first device is 200/(200+300) = 0.4. Of the second one - 300/(200+300) = 0.6 Portion of IO-requests issued for each device will be written in parallel in the same time 0.4N/200 = 0.6N/300 sec. Therefore, throughput of our logical volume in this case is N/(0.4N/200) = 500 M/sec. The resulted table of throughput: Linear RAID: <300 M/sec RAID-0: 400 M/sec Parallel scaling out FS 500 M/sec According to definitions above any local file system built on a top of RAID/LVM does NOT possesses parallel scaling out (first, because RAID and LVM don't save disk resources, second, because latency introduced by free space allocator grows with volume. For the same reasons any local FS, which implements its own block layer (ZFS, Btrfs, etc) does NOT possesses parallel scaling out. Note that any network FS built on a top of two or more local FS managing simple partitions as backend saves disk resources. = Overhead of parallelism for local FS = As we mentioned above, the characteristic feature of any FS with PSO is that before adding a device to a logical volume you should format it. Of course, it adds some overhead to the system. However, that overhead is not dramatically large. Specifically, with reiser4 disk format40 specifications the disk overhead includes 80K at the beginning of each device-component. Next, for each device Reiser5 reads on-disk super-block and loads its to memory, Thus, memory overhead includes one persistent memory super-block (~500 bytes) per each device-component of a logical volume. That is, a logical volume composed of one million devices will take ~500M of memory (pinned). I think that a person maintaining such volume will be able to find $30 for additional memory card. That overhead is a single disadvantage of FS with PSO. At least, we don't know other ones. = Аsymmetric logical volumes. Data and meta-data bricks = So, any logical volume with parallel scaling out is composed of block devices formatted by mkfs utility. Such device has a special name '''brick''', or '''storage subvolume''' of a logical volume. For the beginning we have implemented the simplest approach, when meta-data is located on dedicated block devices - we'll call them '''meta-data bricks'''. I remind that in reiser4 the notion of "meta-data" includes all kind of items (key'ed records in the storage tree). And the notion of data means unformatted blocks pointed out by "extent pointers". Such unformatted nodes are used to store bodies of regular files. Meta-data bricks are allowed to contain unformatted data blocks. '''Data bricks''' contain only unformatted data blocks. For obvious reasons such logical volumes are called "asymmetric". = Stripes. Fibers. Distribution, allocation and migration. Basic definitions = '''Stripe''' is a logical unit of distribution, that is a minimal object, any parts of which can not be stored on different bricks. A set of distribution units dispatched to the same brick is called '''fiber'''. Comment. In the previous art fibers were called stripes (case of RAID-0), and logical units of distribution didn't have a special name. For a number of adjacent sectors forming such a unit a notion of "stripe width" was used. '''Data stripe''' is a logical block of some size at some offset in a file. '''Meta-data striping''' also can be defined, but we don't consider it here for simplicity. '''File system block''' is, as usual, an allocation unit on some brick. Stripe is said '''allocated''', if all its parts got disk addresses on some brick. From these definitions it directly follows that file system block can not contain more than one stripe. On the other hand, an allocated stripe can occupy many blocks. For any file system block its '''full address''' in a logical volume is defined as a pair (brick-ID, disk-address). Stripe is said '''dispatched''', if it got the first component (brick-ID) of its full address in the logical volume. Stripe is said '''migrated''', if its old disk addresses got released, and new ones (possibly on another brick) got allocated. The core difference between parallel and non-parallel scaling out in terms of distribution and allocation: In PSO-systems any stripe firstly gets distributed, then allocated. In systems with non-parallel scaling out it is other way around - any stripe firstly gets allocated, then distributed. An example is any local FS built a top of RAID-0 array. Indeed, at first, such FS allocates a virtual disk address for a logical block, then block layer assigns a real device-ID and translates that virtual address to real one. = Data distribution and migration. Fiber-Striping. Burst Buffers = Distribution defines what device-component of a logical volume an IO request composed for a dirty buffer(page) will be issued against. In file systems with PSO "destination" device is always defined by a virtual disk address allocated for that request. E.g. for RAID-0 ID of destination device is defined as (N % M), where N is a virtual address (block number), allocated by the file system, M is number of disks in the array. In our approach (O(1) space allocator) we allocate disk addresses on each physical device independently, so for every IO-request we first need to assign a destination device, then ask a block allocator managing that device to allocate a block number for this request. So, in our approach distribution doesn't depend on allocation. By default Reiser5 offers distribution based on algorithms (so-called '''fiber-striping''') invented by Eduard Shishkin (patented stuff). With our algorithms all your data will be distributed evenly and fairly among all devices-components of the logical volume. It means that portion of IO requests issued against each device is equal to relative capacity of that device assigned by user. Operation of adding/removing a device to/from a logical volume automatically invokes data migration, so that resulted distribution is also fair. Portion of migrated data is always equal to relative capacity of the added/removed device. The speed of data migration is mostly determined by throughput of the device to be added/removed. Alternatively, Reiser4 allows users to control data distribution and migration themselves. An important application distribution and migration find as '''data tiering''' in HPC area as so-called '''Burst Buffers''' (dump of "hot data" on high-performance proxy-device with its following migration to "persistent storage" in background mode). In all cases the file system memorizes stripes location. = Atomicity of volume operations = Almost all volume operations (adding/removing a brick, changing bricks capacity, etc) involve re-balancing (i.e. massive migration of data blocks), so it is technically difficult to implement full atomicity of such operations. Instead, we issue 2 checkpoints (first before re-balancing, second - after), and handle 2 cases depending on where in relation to those points the volume operation was interrupted. In the first case user should repeat the operation again, in the second case user should complete the operation (in the background mode) using volume.reiser4 utility. See administration guide on reiser4 logical volumes for details. = Limitations on asymmetric logical volumes = Maximal number of bricks in a logical volume: in the "builtin" distribution mode - 2^32 in the "custom" distribution mode - 2^64 In the "builtin" distribution mode any 2 bricks of the same logical volume can not differ in size more than 2^19 (~1 million) times. For example, your logical volume can not contain both, 1M and 2T bricks. Maximal number of stripe pointers held by one 4K-metadata block: 75 (for node40 format). Maximal number of data blocks served by 1 meta-data block: 75*S, where S is stripe width in file system blocks. For example, for 128K- stripes and 4K blocks (S=32) one meta-data block can serve not more than 2400 data blocks. In particular, when all bricks are of equal capacity, it means that one meta-data brick can serve not more than 2400 data bricks. For the best quality of "builtin" distribution it is recommended that: a) stripe size is not larger than 1/10000 of total volume size. b) number of bricks in your logical volume is a power of 2 (i.e. 2, 4, 8, 16, etc). If you cannot afford it, then make sure that number of hash space segments (a property of your logical volume, which can be increased online) is not smaller than 100 * number-of-bricks. Not more than one volume operations on the same logical volume can be executed in parallel. If some volume operation is not completed, then attempts to execute other ones will return error (EBUSY). = Security issues = "Builtin" distribution combines random and deterministic methods. It is "salted" with volume-ID, which is known only to root. Once it is compromised (revealed), the logical volume can be subjected to "free space attack" - with known volume-ID an attacker (non-privileged user) will be able to fill some data brick up to 100%, while others have a lot of free space. Thus, nobody will be able to write anymore to that volume. So, keep your volume-ID a secret! = Software and Disk Version 5.1.3. Compatibility = To implement parallel scaling out we upgraded Reiser4 code base with the following new plugins: 1) "asymmetric" volume plugin (new interface); 2) "fsx32" distribution plugin (new interface); 3) "striped" file plugin (existing interface); 4) "extent41" item plugin (existing interface); 5) "format41" disk format plugin (existing interface). In the best traditions we increment version numbers. The old disk and software version was 4.0.2. "Minor" number (2) is incremented because of (1-4). "Major" number (0) is incremented because of (5) and changes in the format super-block. "Principal" number (4) is incremented because of changes in master super-block. For more details about compatibility see [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model this] Old reiser4 partitions (of format 4.0.X) will be supported by Reiser5 kernel module. For this you need to enable option "support "Plan-A key allocation scheme" (not default), when configuring the kernel. Note that it will automatically disable support of logical volumes. Such mutual exclusiveness is due to performance reasons. Reiser4progs of software release number 5.X.Y don't support old reiser4 partitions of format 4.0.X. To fsck the last ones use reiser4progs of software release number 4.X.Y - it will exist as a separate branch. = TODO = . Interface for user-defined data distribution and migration (Burst Buffers); . Upgrading FSCK to work with logical volumes; . Asymmetric LV w/ more than 1 meta-data brick per-volume; . Symmetric logical volumes (meta-data on all bricks); . 3D-snapshots of LV (snapshots with an ability to roll back not only file operations, but also volume operations); . Global (networking) logical volumes. [[category:Reiser4]] 410192e7c2964dfe321fead49b51c7e94506e6ee 4355 2020-05-02T18:48:57Z Edward 4 Added "Logical Volumes Background" page Reiser4 offers a brand new method of aggregation of block devices into logical volumes on a local machine. It is a qualitatively new level in file systems (and operating systems) development - local volumes with '''parallel scaling out'''. Reiser4 doesn't implement its own block layer like ZFS etc. In our approach scaling out is performed by file system means, rather than by block layer means. The flow of IO-requests issued against each device is controlled by user. To add a device to a logical volume with parallel scaling out, you first need to format that device - this is the difference between parallel and non-parallel scaling at first glance. The principal difference between parallel and non-parallel scaling out will be discussed below. Systems with parallel scaling out provide better scalability and resolve a number of problems inherent to non-parallel ones. In logical volumes with parallel scaling out devices of smaller size and(or) throughput don't become a "bottlenecks", as it happens e.g. in RAID-0 and its popular modifications. = Fundamental shortcomings of logical volumes composed by block-layer means = 1. Local file systems don't take participation in scaling out. They just face a huge virtual block device, for which they need to maintain a free space map. Such maps grow as the volume fills with data. It results in increasing latency on free blocks allocation and consequently in essential performance drop on large volumes which are almost full. 2. Loss of disk resources (space and throughput) on logical volumes composed of devices with different physical and geometric parameters (because of poor translations provided by "classic" RAID levels). Low-performance devices become a bottleneck in RAID arrays Attempts to replace RAID levels with better algorithms lead to inevitable and unfixable degradation of logical volumes. Indeed, defragmentation tools work only on the space of virtual disk addresses. If you use classic RAID levels, then everything is fine here: reducing fragmentation on virtual device always results in reducing fragmentation on physical ones. However, if you use more sophisticated translations to save disk space and bandwidth, then fragmentations on real devices tends to accumulate, and you are not able to reduce it just defragmenting the virtual device. Note that the interest is always real devices - no one actually cares what happens on virtual ones. 3. With only block layer means it is impossible to build heterogeneous storage composed of devices of different nature. You are not able to use different approaches to devices-components of the same logical volume (e.g. defragment only rotational drives, issue discard requests only for solid state drives, etc). 4. It is impossible to efficiently implement data migration (and, hence, data tiering) on logical volumes composed by block-layer means. = The previous art = Previously, there was only one method for scaling out local volumes - by block layer means. That is, file system deals only with virtual disk addresses (allocation, defragmentation, etc), and the block layer translates virtual disk addresses to real ones and backward. The most common complaint is about performance drop on such logical volumes, which are large and more than 70% full. Mostly it is related to disk space allocators, which are, to put it mildly, not excellent, and introduce big latency when searching for a free blocks on extra-large volumes. Moreover, nothing better has been invented for the past 30 years. Also, it easily may be that the best algorithms for free space management simply do not exist. Some file systems (ZFS and like) implement their own block layers. It helps to implement a failover, however, the mentioned problem doesn't disappear - if the block layer does its job very well, then the file system, again, faces a huge logical block device, which is hard to handle. Significant progress in scaling out was made by parallel network file systems (GPFS, Lustre, etc). However it was unclear, how to apply their technologies to a local FS. Mostly, it is because local file systems don't have such luxury like "backend storage" as the network ones do. What local FS does have - is only extremely poor interface of interaction with the block layer. For example, in Linux local FS can only compose and issue an IO request against some buffer (page). In other words, it was unclear, what a "local parallel file system" is. = Our approach. O(1) space allocator = In ~2010 we had realized that the first approach (implementation an own block layer inside a local FS) is totally wrong. Instead we need to pay attention to parallel network file systems to adopt their methods. However, as I mentioned, there is no something even close to a direct analogy - it means that for local FS we need to design "parallelism" from scratch. The same about distribution algorithms - we are totally unhappy with existing ones. Of course, you can deploy a networking FS on the same local machine for a number of block devices, but it will be something not serious. We state that a serious analogy can be defined and implemented in properly designed local FS - meet Reiser5. The basic idea is pretty simple - to not mess with large free space maps (whose sizes depend on the volume size). Instead, we need to manage many small ones of limited size. At any moment the file system should be able to pick up a proper such small space map, and work only with it. Needless to say, that for any logical volume, which is as big as you want, search time in a such map will be also limited by some value, which doesn't depend on logical volume size. For this reason, we'll call it O(1) - space allocator. The simplest way is to maintain one space map per each block device, which is a component of the logical volume. If some device is too large, simply split it into a number of partitions to make sure that any space map does not exceed some upper limit. Thus, users also should put some efforts from their side to make the space allocator be O(1). = Parallel scaling out as disk resources conservation. Definitions and examples = Here we'll consider an abstract subsystem S of the operating system managing a logical volume composed of one, or more removable components. ''Definition 1''. We'll say that S '''saves disk space and throughput''' of the logical volume, if 1) its data capacity is a sum of data capacities of its components 2) its disk bandwidth is a sum of disk bandwidths of its components We'll say that LV managed by such system is with '''parallel scaling out''' (PSO). There is a good analogy to understand the feature of PSO: imagine that it rains and you put several cylindrical buckets with different sized holes for collecting water. In this example raindrops represent IO-requests, the set of bucket represents a logical volume. Note that amount of water felt to each bucket is proportional to the square of its hole (considered as throughput). In this example all buckets are filled with water evenly and fairly: if one bucket is full, then other ones is also full. Note, that non-cylindrical form of buckets will likely break fairness of water distribution between them, so that PSO won't take place in this case. In practice, however, IO-systems are more complicated: IO requests are distributed, queued, etc. And conservation of disk resources usually doesn't take place: disk bandwidth of any logical volume turns out to be smaller than the sum of ones of its components. Nevertheless, if the loss of resources is small, and doesn't increase with the growth of the volume, then we'll say that such system features '''parallel scaling out'''. In complex IO-systems "leak" of disk bandwidth has complex nature and can happen on every its subsystem: on the file system, on the block layer, etc. The loss can also be caused by interface properties, etc. The fundamental reason of almost all resource leaks is that mentioned subsystems were poorly designed (because better algorithms were not known at that moment, or because of other reasons). The classic example of disk space and throughput loss is RAID arrays. Linear RAID saves disk space, but always drops disk bandwidth. RAID-0, composed of different size and bandwidth devices, drops both, disk space and disk bandwidth of the resulted logical volume. The same is for their modifications like RAID-5. In all mentioned examples the loss of disk bandwidth is caused by poor algorithms (specifically, by the fact that IO requests are directed to every component in wrong proportions). ''Definition 2''. A file system managing a logical volume is said to be with '''parallel scaling out''', if it saves disk space and bandwidth of that logical volume. In other words, if it doesn't drop the mentioned disk resources. Note that file system is only a part of an IO-subsystem. And it can easily happen that the file system saves disk resources, while the whole system is not. For example, because of poorly designed block layer, who puts IO requests issued for different block devices to the same queue on a local machine, etc. As an example, let's calculate disk bandwidth of a logical volume composed of 2 devices, the first of which has disk bandwidth 200M/sec, second - 300M/sec. We'll consider 3 systems: in the first one the mentioned devices compose linear RAID, in the second one - striped RAID (RAID-0), in the third one they are managed by a file system with parallel scaling out. Linear RAID distributes IO requests not evenly: first we write to the first device. Once it is full, we write to the second one. Disk bandwidth of linear RAID is defined by the throughput of the device we are currently writing to. Thus it always is not more than throughput of the faster device, i.e. 300 M/sec. RAID-0 distributes IO requests evenly (but not fairly). In the same interval of time the same number N/2 of IO-requests will be issued against each device. On the first device it will be written in N/400 sec. On the second device it will be written in N/600 sec. Note that the first device is slower, therefore we should wait N/400 sec for all N IO-requests to be written to the array. So throughput of RAID-0 in our case is N/(N/400) = 400 M/sec. FS with parallel scaling out distributes IO requests evenly and fairly. In the same interval of time the number of blocks issued against each device is N*C, where C is relative throughput of the device. Relative throughput of the first device is 200/(200+300) = 0.4. Of the second one - 300/(200+300) = 0.6 Portion of IO-requests issued for each device will be written in parallel in the same time 0.4N/200 = 0.6N/300 sec. Therefore, throughput of our logical volume in this case is N/(0.4N/200) = 500 M/sec. The resulted table of throughput: Linear RAID: <300 M/sec RAID-0: 400 M/sec Parallel scaling out FS 500 M/sec According to definitions above any local file system built on a top of RAID/LVM does NOT possesses parallel scaling out (first, because RAID and LVM don't save disk resources, second, because latency introduced by free space allocator grows with volume. For the same reasons any local FS, which implements its own block layer (ZFS, Btrfs, etc) does NOT possesses parallel scaling out. Note that any network FS built on a top of two or more local FS managing simple partitions as backend saves disk resources. = Overhead of parallelism for local FS = As we mentioned above, the characteristic feature of any FS with PSO is that before adding a device to a logical volume you should format it. Of course, it adds some overhead to the system. However, that overhead is not dramatically large. Specifically, with reiser4 disk format40 specifications the disk overhead includes 80K at the beginning of each device-component. Next, for each device Reiser5 reads on-disk super-block and loads its to memory, Thus, memory overhead includes one persistent memory super-block (~500 bytes) per each device-component of a logical volume. That is, a logical volume composed of one million devices will take ~500M of memory (pinned). I think that a person maintaining such volume will be able to find $30 for additional memory card. That overhead is a single disadvantage of FS with PSO. At least, we don't know other ones. = Аsymmetric logical volumes. Data and meta-data bricks = So, any logical volume with parallel scaling out is composed of block devices formatted by mkfs utility. Such device has a special name '''brick''', or '''storage subvolume''' of a logical volume. For the beginning we have implemented the simplest approach, when meta-data is located on dedicated block devices - we'll call them '''meta-data bricks'''. I remind that in reiser4 the notion of "meta-data" includes all kind of items (key'ed records in the storage tree). And the notion of data means unformatted blocks pointed out by "extent pointers". Such unformatted nodes are used to store bodies of regular files. Meta-data bricks are allowed to contain unformatted data blocks. '''Data bricks''' contain only unformatted data blocks. For obvious reasons such logical volumes are called "asymmetric". = Stripes. Fibers. Distribution, allocation and migration. Basic definitions = '''Stripe''' is a logical unit of distribution, that is a minimal object, any parts of which can not be stored on different bricks. A set of distribution units dispatched to the same brick is called '''fiber'''. Comment. In the previous art fibers were called stripes (case of RAID-0), and logical units of distribution didn't have a special name. For a number of adjacent sectors forming such a unit a notion of "stripe width" was used. '''Data stripe''' is a logical block of some size at some offset in a file. '''Meta-data striping''' also can be defined, but we don't consider it here for simplicity. '''File system block''' is, as usual, an allocation unit on some brick. Stripe is said '''allocated''', if all its parts got disk addresses on some brick. From these definitions it directly follows that file system block can not contain more than one stripe. On the other hand, an allocated stripe can occupy many blocks. For any file system block its '''full address''' in a logical volume is defined as a pair (brick-ID, disk-address). Stripe is said '''dispatched''', if it got the first component (brick-ID) of its full address in the logical volume. Stripe is said '''migrated''', if its old disk addresses got released, and new ones (possibly on another brick) got allocated. The core difference between parallel and non-parallel scaling out in terms of distribution and allocation: In PSO-systems any stripe firstly gets distributed, then allocated. In systems with non-parallel scaling out it is other way around - any stripe firstly gets allocated, then distributed. An example is any local FS built a top of RAID-0 array. Indeed, at first, such FS allocates a virtual disk address for a logical block, then block layer assigns a real device-ID and translates that virtual address to real one. = Data distribution and migration. Fiber-Striping. Burst Buffers = Distribution defines what device-component of a logical volume an IO request composed for a dirty buffer(page) will be issued against. In file systems with PSO "destination" device is always defined by a virtual disk address allocated for that request. E.g. for RAID-0 ID of destination device is defined as (N % M), where N is a virtual address (block number), allocated by the file system, M is number of disks in the array. In our approach (O(1) space allocator) we allocate disk addresses on each physical device independently, so for every IO-request we first need to assign a destination device, then ask a block allocator managing that device to allocate a block number for this request. So, in our approach distribution doesn't depend on allocation. By default Reiser5 offers distribution based on algorithms (so-called '''fiber-striping''') invented by Eduard Shishkin (patented stuff). With our algorithms all your data will be distributed evenly and fairly among all devices-components of the logical volume. It means that portion of IO requests issued against each device is equal to relative capacity of that device assigned by user. Operation of adding/removing a device to/from a logical volume automatically invokes data migration, so that resulted distribution is also fair. Portion of migrated data is always equal to relative capacity of the added/removed device. The speed of data migration is mostly determined by throughput of the device to be added/removed. Alternatively, Reiser4 allows users to control data distribution and migration themselves. An important application distribution and migration find as '''data tiering''' in HPC area as so-called '''Burst Buffers''' (dump of "hot data" on high-performance proxy-device with its following migration to "persistent storage" in background mode). In all cases the file system memorizes stripes location. = Atomicity of volume operations = Almost all volume operations (adding/removing a brick, changing bricks capacity, etc) involve re-balancing (i.e. massive migration of data blocks), so it is technically difficult to implement full atomicity of such operations. Instead, we issue 2 checkpoints (first before re-balancing, second - after), and handle 2 cases depending on where in relation to those points the volume operation was interrupted. In the first case user should repeat the operation again, in the second case user should complete the operation (in the background mode) using volume.reiser4 utility. See administration guide on reiser4 logical volumes for details. = Limitations on asymmetric logical volumes = Maximal number of bricks in a logical volume: in the "builtin" distribution mode - 2^32 in the "custom" distribution mode - 2^64 In the "builtin" distribution mode any 2 bricks of the same logical volume can not differ in size more than 2^19 (~1 million) times. For example, your logical volume can not contain both, 1M and 2T bricks. Maximal number of stripe pointers held by one 4K-metadata block: 75 (for node40 format). Maximal number of data blocks served by 1 meta-data block: 75*S, where S is stripe width in file system blocks. For example, for 128K- stripes and 4K blocks (S=32) one meta-data block can serve not more than 2400 data blocks. In particular, when all bricks are of equal capacity, it means that one meta-data brick can serve not more than 2400 data bricks. For the best quality of "builtin" distribution it is recommended that: a) stripe size is not larger than 1/10000 of total volume size. b) number of bricks in your logical volume is a power of 2 (i.e. 2, 4, 8, 16, etc). If you cannot afford it, then make sure that number of hash space segments (a property of your logical volume, which can be increased online) is not smaller than 100 * number-of-bricks. Not more than one volume operations on the same logical volume can be executed in parallel. If some volume operation is not completed, then attempts to execute other ones will return error (EBUSY). = Security issues = "Builtin" distribution combines random and deterministic methods. It is "salted" with volume-ID, which is known only to root. Once it is compromised (revealed), the logical volume can be subjected to "free space attack" - with known volume-ID an attacker (non-privileged user) will be able to fill some data brick up to 100%, while others have a lot of free space. Thus, nobody will be able to write anymore to that volume. So, keep your volume-ID a secret! = Software and Disk Version 5.1.3. Compatibility = To implement parallel scaling out we upgraded Reiser4 code base with the following new plugins: 1) "asymmetric" volume plugin (new interface); 2) "fsx32" distribution plugin (new interface); 3) "striped" file plugin (existing interface); 4) "extent41" item plugin (existing interface); 5) "format41" disk format plugin (existing interface). In the best traditions we increment version numbers. The old disk and software version was 4.0.2. "Minor" number (2) is incremented because of (1-4). "Major" number (0) is incremented because of (5) and changes in the format super-block. "Principal" number (4) is incremented because of changes in master super-block. For more details about compatibility see [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model this] Old reiser4 partitions (of format 4.0.X) will be supported by Reiser5 kernel module. For this you need to enable option "support "Plan-A key allocation scheme" (not default), when configuring the kernel. Note that it will automatically disable support of logical volumes. Such mutual exclusiveness is due to performance reasons. Reiser4progs of software release number 5.X.Y don't support old reiser4 partitions of format 4.0.X. To fsck the last ones use reiser4progs of software release number 4.X.Y - it will exist as a separate branch. = TODO = . Interface for user-defined data distribution and migration (Burst Buffers); . Upgrading FSCK to work with logical volumes; . Asymmetric LV w/ more than 1 meta-data brick per-volume; . Symmetric logical volumes (meta-data on all bricks); . 3D-snapshots of LV (snapshots with an ability to roll back not only file operations, but also volume operations); . Global (networking) logical volumes. 43e9b13707295716b27adf901846d8bdd7e6ebd0 Logical Volumes Howto 0 1113 4462 4460 2022-07-13T07:11:37Z Edward 4 '''WARNING! FSCK doesn't work correctly on volumes with scaling-out abilities. Any attempts to repair such volumes will result in data loss''' Disk format 5.1.3 allows to create volumes with '''scaling-out''' abilities. That is, you are able to easily add a '''brick''' (i.e. formatted block device) to your volume, thus, increasing its capacity. Bricks don't need to be of equal size: data will be distributed '''fairly''' among different sized bricks. Also, at any time you are able to remove such brick. Moreover, at any time you can change an abstract capacity of any brick, thus, controlling a portion of IO requests being issued against each brick in your logical volume. More information can be found in [[Logical Volumes Background]]. Finally, you can add a '''proxy-brick''' created over a high-performance block device (e.g. SSD or NVRAM) to your logical volume. It allows to increase performance of file operations in the case when your volume is composed of slow media. To get started with logical volumes you will need Reiser4 kernel module of SFRN (Software Framework Release Number) 5.1.3, or higher. The patches against stock kernels can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ here]. When building the kernel, make sure that "Plan-A key allocation scheme" is disabled in the configuration menu (or CONFIG_REISER4_OLD is not set, which is the same). The important note is that volumes based on the old disk format (''"format40"'') are not scalable (you are not able to add a brick to such volume). Moreover, it is impossible to mount volumes managed by new and old disk format in the same system: you need to choose (by enabling, or disabling the mentioned configuration option). This is for performance reasons. To work with scalable volumes you will need [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.3], or higher. It requires [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal-1.0.7]. To create a scalable volume, just build the first brick of your volume by formatting some block-device with mkfs.reiser4 utility and mount it. Then you will be able to grow it by adding new bricks by volume.reiser4 utility. See [[Logical_Volumes_Administration|logical volumes administration guide]] for details. For managing proxy-bricks see [[Proxy_Device_Administration|proxy-device administration guide]]. [[category:Reiser4]] 917b02d953538f5bd171cf046012005bad462582 4460 4446 2022-07-12T20:10:13Z Edward 4 Add warning about fsck '''WARNING! FSCK doesn't work correctly on volumes with scaling out abilities. Any attempts to repair such volumes will result in data loss''' Disk format 5.1.3 allows to create volumes with '''scaling-out''' abilities. That is, you are able to easily add a '''brick''' (i.e. formatted block device) to your volume, thus, increasing its capacity. Bricks don't need to be of equal size: data will be distributed '''fairly''' among different sized bricks. Also, at any time you are able to remove such brick. Moreover, at any time you can change an abstract capacity of any brick, thus, controlling a portion of IO requests being issued against each brick in your logical volume. More information can be found in [[Logical Volumes Background]]. Finally, you can add a '''proxy-brick''' created over a high-performance block device (e.g. SSD or NVRAM) to your logical volume. It allows to increase performance of file operations in the case when your volume is composed of slow media. To get started with logical volumes you will need Reiser4 kernel module of SFRN (Software Framework Release Number) 5.1.3, or higher. The patches against stock kernels can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ here]. When building the kernel, make sure that "Plan-A key allocation scheme" is disabled in the configuration menu (or CONFIG_REISER4_OLD is not set, which is the same). The important note is that volumes based on the old disk format (''"format40"'') are not scalable (you are not able to add a brick to such volume). Moreover, it is impossible to mount volumes managed by new and old disk format in the same system: you need to choose (by enabling, or disabling the mentioned configuration option). This is for performance reasons. To work with scalable volumes you will need [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.3], or higher. It requires [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal-1.0.7]. To create a scalable volume, just build the first brick of your volume by formatting some block-device with mkfs.reiser4 utility and mount it. Then you will be able to grow it by adding new bricks by volume.reiser4 utility. See [[Logical_Volumes_Administration|logical volumes administration guide]] for details. For managing proxy-bricks see [[Proxy_Device_Administration|proxy-device administration guide]]. [[category:Reiser4]] 8c175191f605cd20b947bfc9464ea2616462bc5c 4446 4444 2020-11-12T17:01:25Z Chris goe 2 links wikified Disk format 5.1.3 allows to create volumes with '''scaling-out''' abilities. That is, you are able to easily add a '''brick''' (i.e. formatted block device) to your volume, thus, increasing its capacity. Bricks don't need to be of equal size: data will be distributed '''fairly''' among different sized bricks. Also, at any time you are able to remove such brick. Moreover, at any time you can change an abstract capacity of any brick, thus, controlling a portion of IO requests being issued against each brick in your logical volume. More information can be found in [[Logical Volumes Background]]. Finally, you can add a '''proxy-brick''' created over a high-performance block device (e.g. SSD or NVRAM) to your logical volume. It allows to increase performance of file operations in the case when your volume is composed of slow media. To get started with logical volumes you will need Reiser4 kernel module of SFRN (Software Framework Release Number) 5.1.3, or higher. The patches against stock kernels can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ here]. When building the kernel, make sure that "Plan-A key allocation scheme" is disabled in the configuration menu (or CONFIG_REISER4_OLD is not set, which is the same). The important note is that volumes based on the old disk format (''"format40"'') are not scalable (you are not able to add a brick to such volume). Moreover, it is impossible to mount volumes managed by new and old disk format in the same system: you need to choose (by enabling, or disabling the mentioned configuration option). This is for performance reasons. To work with scalable volumes you will need [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.3], or higher. It requires [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal-1.0.7]. To create a scalable volume, just build the first brick of your volume by formatting some block-device with mkfs.reiser4 utility and mount it. Then you will be able to grow it by adding new bricks by volume.reiser4 utility. See [[Logical_Volumes_Administration|logical volumes administration guide]] for details. For managing proxy-bricks see [[Proxy_Device_Administration|proxy-device administration guide]]. [[category:Reiser4]] 4eb0b2ade4a5d120120fd2cf0c34e6cb37bcb0b1 4444 4398 2020-11-12T16:53:58Z Edward 4 Disk format 5.1.3 allows to create volumes with '''scaling-out''' abilities. That is, you are able to easily add a '''brick''' (i.e. formatted block device) to your volume, thus, increasing its capacity. Bricks don't need to be of equal size: data will be distributed '''fairly''' among different sized bricks. Also, at any time you are able to remove such brick. Moreover, at any time you can change an abstract capacity of any brick, thus, controlling a portion of IO requests being issued against each brick in your logical volume. More information can be found [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background here]. Finally, you can add a '''proxy-brick''' created over a high-performance block device (e.g. SSD or NVRAM) to your logical volume. It allows to increase performance of file operations in the case when your volume is composed of slow media. To get started with logical volumes you will need Reiser4 kernel module of SFRN (Software Framework Release Number) 5.1.3, or higher. The patches against stock kernels can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ here]. When building the kernel, make sure that "Plan-A key allocation scheme" is disabled in the configuration menu (or CONFIG_REISER4_OLD is not set, which is the same). The important note is that volumes based on the old disk format (''"format40"'') are not scalable (you are not able to add a brick to such volume). Moreover, it is impossible to mount volumes managed by new and old disk format in the same system: you need to choose (by enabling, or disabling the mentioned configuration option). This is for performance reasons. To work with scalable volumes you will need [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.3], or higher. It requires [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal-1.0.7]. To create a scalable volume, just build the first brick of your volume by formatting some block-device with mkfs.reiser4 utility and mount it. Then you will be able to grow it by adding new bricks by volume.reiser4 utility. See [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration logical volumes administration guide] for details. For managing proxy-bricks see [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration proxy-device administration guide]. [[category:Reiser4]] 1f48b3b7337463d81b71a2c2ebc2f08eed708065 4398 4395 2020-08-18T20:16:54Z Edward 4 New disk format (''"format41"'') allows to create volumes with '''scaling-out''' abilities. That is, you are able to easily add a '''brick''' (i.e. formatted block device) to your volume, thus, increasing its capacity. Bricks don't need to be of equal size: data will be distributed '''fairly''' among different sized bricks. Also, at any time you are able to remove such brick. Moreover, at any time you can change an abstract throughput of any brick, thus, controlling a portion of IO requests being issued against each brick in your logical volume. More information can be found [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background here]. Finally, you can add a '''proxy-brick''' created over a high-performance block device (e.g. SSD or NVRAM) to your logical volume. It allows to increase performance of file operations in the case when your volume is composed of slow media. To get started with logical volumes you will need Reiser4 kernel module of SFRN (Software Framework Release Number) 5.1.3, or higher. The patches against stock kernels can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ here]. When building the kernel, make sure that "Plan-A key allocation scheme" is disabled in the configuration menu (or CONFIG_REISER4_OLD is not set, which is the same). The important note is that volumes based on the old disk format (''"format40"'') are not scalable (you are not able to add a brick to such volume). Moreover, it is impossible to mount volumes managed by new and old disk format in the same system: you need to choose (by enabling, or disabling the mentioned configuration option). This is for performance reasons. To work with scalable volumes you will need [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.3], or higher. It requires [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal-1.0.7]. To create a scalable volume, just build the first brick of your volume by formatting some block-device with mkfs.reiser4 utility and mount it. Then you will be able to grow it by adding new bricks by volume.reiser4 utility. See [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration logical volumes administration guide] for details. For managing proxy-bricks see [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration proxy-device administration guide]. [[category:Reiser4]] 2962b644e558364ea8129b5c7c09a6c895814382 4395 4394 2020-08-18T14:04:38Z Edward 4 New disk format (''"format41"'') allows to create volumes with '''scaling-out''' abilities. That is, you are able to easily add a '''brick''' (i.e. formatted block device) to your volume, thus, increasing its capacity. Bricks don't need to be of equal size: data will be distributed '''fairly''' among different sized bricks. Also, at any time you are able to remove such brick. Moreover, at any time you can change throughput of any brick, thus, changing a portion of IO requests issued against each brick in your logical volume. More information can be found [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background here]. Finally, you can add a '''proxy-brick''' created over a high-performance block device (e.g. SSD or NVRAM) to your logical volume. It allows to increase performance of file operations in the case when your volume is composed of slow media. To get started with logical volumes you will need Reiser4 kernel module of SFRN (Software Framework Release Number) 5.1.3, or higher. The patches against stock kernels can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ here]. When building the kernel, make sure that "Plan-A key allocation scheme" is disabled in the configuration menu (or CONFIG_REISER4_OLD is not set, which is the same). The important note is that volumes based on the old disk format (''"format40"'') are not scalable (you are not able to add a brick to such volume). Moreover, it is impossible to mount volumes managed by new and old disk format in the same system: you need to choose (by enabling, or disabling the mentioned configuration option). This is for performance reasons. To work with scalable volumes you will need [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.3], or higher. It requires [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal-1.0.7]. To create a scalable volume, just build the first brick of your volume by formatting some block-device with mkfs.reiser4 utility and mount it. Then you will be able to grow it by adding new bricks by volume.reiser4 utility. See [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration logical volumes administration guide] for details. For managing proxy-bricks see [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration proxy-device administration guide]. [[category:Reiser4]] 14ebbaf884a1855ce2eac0da0268196218296e93 4394 4392 2020-08-18T13:55:07Z Edward 4 New disk format (''"format41"'') allows to create volumes with '''scaling-out''' abilities. That is, you are able to easily add a '''brick''' (i.e. formatted block device) to your volume, thus, increasing its capacity. Bricks don't need to be of equal size: data will be distributed '''fairly''' among different sized bricks. Also, at any time you are able to remove such brick. Also, at any time you can change throughput of any brick, thus, changing a portion of IO requests issued against each brick in your logical volume. More information can be found [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background here]. Finally, you can add a '''proxy-brick''' created over a high-performance block device (e.g. SSD or NVRAM) to your logical volume. It allows to increase performance of file operations in the case when your volume is composed of slow media. To get started with logical volumes you will need Reiser4 kernel module of SFRN (Software Framework Release Number) 5.1.3, or higher. The patches against stock kernels can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ here]. When building the kernel, make sure that "Plan-A key allocation scheme" is disabled in the configuration menu (or CONFIG_REISER4_OLD is not set, which is the same). The important note is that volumes based on the old disk format (''"format40"'') are not scalable (you are not able to add a brick to such volume). Moreover, it is impossible to mount volumes managed by new and old disk format in the same system: you need to choose (by enabling, or disabling the mentioned configuration option). This is for performance reasons. To work with scalable volumes you will need [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.3], or higher. It requires [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal-1.0.7]. To create a scalable volume, just build the first brick of your volume by formatting some block-device with mkfs.reiser4 utility. Then you will be able to grow it by adding new bricks by volume.reiser4 utility. See [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration logical volumes administration guide] for details. For managing proxy-bricks see [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration proxy-device administration guide]. [[category:Reiser4]] 21d671b59b037a27d1397cfcea014ea336c935d3 4392 2020-08-18T13:27:17Z Edward 4 Added "Logical Volumes Howto" page New disk format (''"format41"'') allows to create volumes with '''scaling-out''' abilities. That is, you are able to easily add a '''brick''' (i.e. a formatted block device) to your volume, thus, increasing its capacity. Bricks don't need to be of equal size: data will be distributed '''fairly''' among different sized bricks. Also, at any time you are able to remove such brick. Also, at any time you can change throughput of any brick, thus, changing a portion of IO requests issued against each brick in your logical volume. More information can be found [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background here]. Finally, you can add a '''proxy-brick''' created over a high-performance block device (e.g. SSD or NVRAM) to your logical volume, thus increasing performance of file operations in the case when your volume is composed of slow media. To get started with logical volumes you need to build Reiser4 kernel module of SFRN (Software Framework Release Number) 5.1.3, or higher. The patches against stock kernels can be found [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ here]. When building the kernel, make sure that "Plan-A key allocation scheme" is disabled in the configuration menu (or CONFIG_REISER4_OLD is not set, which is the same). The important note is that volumes based on the old disk format (''"format40"'') are not scalable (you are not able to add a brick to such volume). Moreover, it is impossible to mount volumes managed by new and old disk format in the same system: you need to choose (by enabling, or disabling the mentioned configuration option). This is for performance reasons. To manage "scalable" volumes you'll need [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.3], or higher. It requires [https://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal/ libaal-1.0.7]. To create a scalable volume, you just need to create the first brick of your volume by formatting some block-device with mkfs.reiser4 utility. Then you will be able to grow it by adding new bricks by volume.reiser4 utility. See [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration logical volumes administration guide] for details. For managing proxy-bricks see [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration proxy device administration guide]. [[category:Reiser4]] 17bacc2c02db9edf98b7840ffd1d68e59dcf4f91 Mailinglists 0 9 2521 1862 2012-09-25T17:37:18Z Chris goe 2 -> You can: * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-devel subscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=unsubscribe&nbsp;reiserfs-devel unsubscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=help get instructions on mailinglist usage.] See the [http://vger.kernel.org/majordomo-info.html VGER Majordomo] site for more information. = Archives = There are several archives of <tt>reiserfs-devel</tt> on the net: * [http://www.spinics.net/lists/reiserfs-devel/ spinics.net] (from ~June 2007) * [http://marc.info/?l=reiserfs-devel marc.info] (from ~April 1998, including the former <tt>reiserfs-list</tt> archives) * [http://www.mail-archive.com/reiserfs-list@namesys.com/ mail-archive.com] (2005-10-14 until 2007-08-06, only the former <tt>reiserfs-list</tt> archives) [[category:ReiserFS]] [[category:Reiser4]] ced564845b153e460d46e3b2329a921970034049 1862 1505 2010-10-18T01:02:32Z Chris goe 2 dates added You can: * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-devel subscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=unsubscribe&nbsp;reiserfs-devel unsubscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=help get instructions on mailinglist usage.] See the [http://vger.kernel.org/majordomo-info.html VGER Majordomo] site for more information. === Archives === There are several archives of <tt>reiserfs-devel</tt> on the net: * [http://www.spinics.net/lists/reiserfs-devel/ spinics.net] (from ~June 2007) * [http://marc.info/?l=reiserfs-devel marc.info] (from ~April 1998, including the former <tt>reiserfs-list</tt> archives) * [http://www.mail-archive.com/reiserfs-list@namesys.com/ mail-archive.com] (2005-10-14 until 2007-08-06, only the former <tt>reiserfs-list</tt> archives) [[category:ReiserFS]] [[category:Reiser4]] 43f301bdc56e3ea090124e370bb80a9939d93c7f 1505 1301 2009-06-27T17:24:53Z Chris goe 2 s/subscribe/unsubscribe You can: * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-devel subscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=unsubscribe&nbsp;reiserfs-devel unsubscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=help get instructions on mailinglist usage.] See the [http://vger.kernel.org/majordomo-info.html VGER Majordomo] site for more information. === Archives === There are several archives of <tt>reiserfs-devel</tt> on the net: * [http://www.spinics.net/lists/reiserfs-devel/ spinics.net] * [http://marc.info/?l=reiserfs-devel marc.info] The archives of the former <tt>reiserfs-list</tt> can be found here: * [http://www.mail-archive.com/reiserfs-list@namesys.com/ mail-archive.com] [[category:ReiserFS]] [[category:Reiser4]] b9eaec5ec5d53e889787ba29012166bbfa612893 1301 1299 2009-06-25T06:32:25Z Chris goe 2 You can: * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-devel subscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-devel unsubscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=help get instructions on mailinglist usage.] See the [http://vger.kernel.org/majordomo-info.html VGER Majordomo] site for more information. === Archives === There are several archives of <tt>reiserfs-devel</tt> on the net: * [http://www.spinics.net/lists/reiserfs-devel/ spinics.net] * [http://marc.info/?l=reiserfs-devel marc.info] The archives of the former <tt>reiserfs-list</tt> can be found here: * [http://www.mail-archive.com/reiserfs-list@namesys.com/ mail-archive.com] [[category:ReiserFS]] [[category:Reiser4]] 3ff36b84be2937a56a4e2703c389001f95c6e0d6 1299 2009-06-25T06:21:59Z Chris goe 2 Created page with 'You can: * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-devel subscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-dev...' You can: * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-devel subscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=subscribe&nbsp;reiserfs-devel unsubscribe to reiserfs-devel] * [mailto:majordomo@vger.kernel.org?body=help get instructions on mailinglist usage.] See the [http://vger.kernel.org/majordomo-info.html VGER Majordomo] site for more information. [[category:ReiserFS]] [[category:Reiser4]] 5e0b855944a51da26bf86927acd5a6b0e3739dc9 Main Page 0 1 4400 4399 2020-08-18T20:24:45Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Logical Volumes Howto | Getting started with Logical Volumes (NEW)]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2020-08-17 - Reiser4 for Linux-5.8.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-06-07 - Reiser4 for Linux-5.7.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-05-30 - Reiser4 for Linux-5.6.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.5.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.4.21 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-15 - Reiser4 for Linux-5.5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-05 - Reiser4 for Linux-5.4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-12-31 - Experimental Format 5.1.3 [https://lwn.net/Articles/808323/ released] 2019-12-31 - Reiser4 for Linux-5.4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-11-10 - Reiser4 for Linux-5.3.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-04-12 - Reiser4 for Linux-5.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} c918be0f1fd6c35ec6794ca6a7ed5aac89b4d5d1 4399 4393 2020-08-18T20:19:32Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Logical Volumes Howto | Getting started with Logical Volumes (NEW)]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2020-08-17 - Reiser4 for Linux-5.8.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-06-07 - Reiser4 for Linux-5.7.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-05-30 - Reiser4 for Linux-5.6.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.5.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.4.21 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-15 - Reiser4 for Linux-5.5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-05 - Reiser4 for Linux-5.4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-12-31 - Reiser4 for Linux-5.4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-11-10 - Reiser4 for Linux-5.3.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-04-12 - Reiser4 for Linux-5.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} feb456eb21b896c6d2e57d3354031f60d278c3f4 4393 4373 2020-08-18T13:36:35Z Edward 4 /* Documentation */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Logical Volumes Howto | Getting started with Logical Volumes (NEW)]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2020-06-07 - Reiser4 for Linux-5.7.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-05-30 - Reiser4 for Linux-5.6.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.5.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.4.21 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-15 - Reiser4 for Linux-5.5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-05 - Reiser4 for Linux-5.4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-12-31 - Reiser4 for Linux-5.4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-11-10 - Reiser4 for Linux-5.3.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-04-12 - Reiser4 for Linux-5.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 5646d3da0f0d341eed187cb4c699704436c45472 4373 4371 2020-07-30T18:34:29Z Chris goe 2 older news moved to [[News_Archive]] __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2020-06-07 - Reiser4 for Linux-5.7.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-05-30 - Reiser4 for Linux-5.6.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.5.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.4.21 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-15 - Reiser4 for Linux-5.5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-05 - Reiser4 for Linux-5.4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-12-31 - Reiser4 for Linux-5.4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-11-10 - Reiser4 for Linux-5.3.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-04-12 - Reiser4 for Linux-5.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 002057bbf291f7aeef7d03d1cce4e50dea76e76a 4371 4327 2020-07-30T18:33:14Z Chris goe 2 lots of new Reiser4 releases __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2020-06-07 - Reiser4 for Linux-5.7.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-05-30 - Reiser4 for Linux-5.6.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.5.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-22 - Reiser4 for Linux-5.4.21 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-15 - Reiser4 for Linux-5.5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2020-02-05 - Reiser4 for Linux-5.4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-12-31 - Reiser4 for Linux-5.4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-11-10 - Reiser4 for Linux-5.3.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-04-12 - Reiser4 for Linux-5.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2018-06-27 - Reiser4 for Linux-4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-04-05 - Reiser4 for Linux-4.16 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-03-28 - Reiser4 for Linux-4.15 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-11-26 - [https://www.spinics.net/lists/reiserfs-devel/msg05650.html reiser4: port for Linux-4.14] released 2017-09-06 - Reiser4 for Linux-4.13 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-08-14 - [https://www.spinics.net/lists/reiserfs-devel/msg05603.html Reiser4 for Linux-4.12] released 2017-06-03 - [https://www.spinics.net/lists/reiserfs-devel/msg05519.html Reiser4: Port for Linux-4.11] released 2017-02-21 - [https://www.spinics.net/lists/reiserfs-devel/msg05385.html reiser4: port for Linux-4.10] released &rarr; See the [[News Archive]] for older news. </div> |} 042bb78b175c69a2405fc8a18201b839b65c2ee9 4327 4309 2019-10-10T19:17:16Z Chris goe 2 new reiser4 updates __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2019-08-13 - Reiser4 for Linux-5.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-08-13 - Reiser4 for Linux-5.1 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2019-04-12 - Reiser4 for Linux-5.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2018-06-27 - Reiser4 for Linux-4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-04-05 - Reiser4 for Linux-4.16 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-03-28 - Reiser4 for Linux-4.15 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-11-26 - [https://www.spinics.net/lists/reiserfs-devel/msg05650.html reiser4: port for Linux-4.14] released 2017-09-06 - Reiser4 for Linux-4.13 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-08-14 - [https://www.spinics.net/lists/reiserfs-devel/msg05603.html Reiser4 for Linux-4.12] released 2017-06-03 - [https://www.spinics.net/lists/reiserfs-devel/msg05519.html Reiser4: Port for Linux-4.11] released 2017-02-21 - [https://www.spinics.net/lists/reiserfs-devel/msg05385.html reiser4: port for Linux-4.10] released &rarr; See the [[News Archive]] for older news. </div> |} 74eba5106b0c6bc0a99010e0ae37b26da6c6a8ed 4309 4303 2019-04-13T16:02:05Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2019-04-12 - Reiser4 for Linux-5.0 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ released] 2018-06-27 - Reiser4 for Linux-4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-04-05 - Reiser4 for Linux-4.16 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-03-28 - Reiser4 for Linux-4.15 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-11-26 - [https://www.spinics.net/lists/reiserfs-devel/msg05650.html reiser4: port for Linux-4.14] released 2017-09-06 - Reiser4 for Linux-4.13 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-08-14 - [https://www.spinics.net/lists/reiserfs-devel/msg05603.html Reiser4 for Linux-4.12] released 2017-06-03 - [https://www.spinics.net/lists/reiserfs-devel/msg05519.html Reiser4: Port for Linux-4.11] released 2017-02-21 - [https://www.spinics.net/lists/reiserfs-devel/msg05385.html reiser4: port for Linux-4.10] released &rarr; See the [[News Archive]] for older news. </div> |} 1389c98c01ae1cc0b32bb9c14be5729d0d14e5cb 4303 4301 2018-09-18T19:02:26Z Chris goe 2 postings referenced __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2018-06-27 - Reiser4 for Linux-4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-04-05 - Reiser4 for Linux-4.16 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-03-28 - Reiser4 for Linux-4.15 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-11-26 - [https://www.spinics.net/lists/reiserfs-devel/msg05650.html reiser4: port for Linux-4.14] released 2017-09-06 - Reiser4 for Linux-4.13 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-08-14 - [https://www.spinics.net/lists/reiserfs-devel/msg05603.html Reiser4 for Linux-4.12] released 2017-06-03 - [https://www.spinics.net/lists/reiserfs-devel/msg05519.html Reiser4: Port for Linux-4.11] released 2017-02-21 - [https://www.spinics.net/lists/reiserfs-devel/msg05385.html reiser4: port for Linux-4.10] released &rarr; See the [[News Archive]] for older news. </div> |} 4198c4690cb91bd08b8e9d26996a2168219a3b7e 4301 4255 2018-09-18T19:00:34Z Chris goe 2 news updated; old news moved into [[News_Archive]] __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2018-06-27 - Reiser4 for Linux-4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-04-05 - Reiser4 for Linux-4.16 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-03-28 - Reiser4 for Linux-4.15 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-11-26 - [https://www.spinics.net/lists/reiserfs-devel/msg05650.html reiser4: port for Linux-4.14] released 2017-09-06 - Reiser4 for Linux-4.13 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-08-14 - [https://www.spinics.net/lists/reiserfs-devel/msg05603.html Reiser4 for Linux-4.12] released 2017-06-03 - Reiser4 for Linux-4.11 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-02-21 - Reiser4 for Linux-4.10 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} a66b0622f8f25db2645163287a2dbdd32644cdfc 4255 4235 2017-06-24T21:29:53Z Edward 4 Added "Debug Reiser4progs" entry __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Debug_Reiser4 | Debug Reiser4]] * [[Debug_Reiser4progs | Debug Reiser4progs]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2017-06-03 - Reiser4 for Linux-4.11 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-02-21 - Reiser4 for Linux-4.10 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-12-17 - Reiser4 for Linux-4.9 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 7579f64e83b50435305ea5ac31c633982bc8b5fb 4235 4231 2017-06-20T13:22:19Z Edward 4 Added Debug Reiser4 entry __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Debug_Reiser4 | Debug Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2017-06-03 - Reiser4 for Linux-4.11 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-02-21 - Reiser4 for Linux-4.10 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-12-17 - Reiser4 for Linux-4.9 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 321c484f6976b288e0a4baed15372c25a8810048 4231 4223 2017-06-08T04:50:17Z Chris goe 2 Reiser4 for Linux-4.11 released __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2017-06-03 - Reiser4 for Linux-4.11 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-02-21 - Reiser4 for Linux-4.10 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-12-17 - Reiser4 for Linux-4.9 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 4b2a14536a6412b71464876d632822f6b99fefab 4223 4217 2017-05-09T13:03:22Z Edward 4 Added "Why Reiser4" item __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Why_Reiser4 | Why Reiser4?]] * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2017-02-21 - Reiser4 for Linux-4.10 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-12-17 - Reiser4 for Linux-4.9 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} f3fe1bf37efa0ba5f2fb5250a4209c8f35aa96da 4217 4201 2017-02-26T23:34:08Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2017-02-21 - Reiser4 for Linux-4.10 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-12-17 - Reiser4 for Linux-4.9 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 0ddb9893ec71df22325a77b23defc02cade503fb 4201 4197 2016-12-25T19:39:33Z DusanC 30310 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-12-17 - Reiser4 for Linux-4.9 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} d3fa8ac844f068d71816fb17c02d8d8bc68be338 4197 4175 2016-11-18T10:41:28Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 6a4003fe4d70f341eb951bdacae685acc406a4d0 4175 4165 2016-09-25T19:21:26Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 789f3ec3bc04c1c35ad8c6bd72f8d6c67a6298bb 4165 4161 2016-09-24T22:02:24Z Chris goe 2 Git trees for libaal and reiser4progs and for fs/reiser4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} cbb7d0b0fc32e971d4b8a68790e3053287623e37 4161 4149 2016-09-08T22:07:06Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 7519ec67bfd6d864fad73c8221c324197b473c3b 4149 4145 2016-06-23T18:33:37Z Chris goe 2 reiserfsprogs v3.6.25 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 3727ebe70483e977ad3f416f1e18ae7c0cf7ceca 4145 4123 2016-06-03T01:57:30Z Chris goe 2 patches for 4.5.3 and 4.6 were released; moving older news to the archive __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} ea9f06738c16dc9e2659fcac008cb1ba1ab703bf 4123 4111 2016-03-30T15:23:05Z Edward 4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 9469521c96b72d935ca31ccb0936f41aa7046cc3 4111 4093 2016-01-14T16:36:18Z Edward 4 Announce release for Linux-4.4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 5326471e6fde7c0920c9f29eb1d662f6a3177838 4093 4091 2015-11-16T09:17:08Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} ebfe2adf809eaa04cbffc6f31a9029f2bcd50cf8 4091 4087 2015-09-24T20:01:23Z Chris goe 2 Reiser4 checksums linked __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 1afa29153d2d17e9842eb7916243709d3f95163f 4087 4081 2015-09-24T19:59:48Z Chris goe 2 +Reiser4_development_model __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Reiser4 development model]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 (meta(data) checksums) [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 2e8c47e883f8cb02ea5d678884f0d149213cdaee 4081 4079 2015-09-24T19:53:57Z Chris goe 2 moved "Further Information" to the left to make more room for the news section __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 (meta(data) checksums) [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> |} 955d9a7b85b192332557281fb3c422e3fca36659 4079 4078 2015-09-23T12:02:06Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 (meta(data) checksums) [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} fe9d689a3dc8b1eaabc22eeb6a6c7867bf41cd75 4078 4077 2015-08-31T15:49:21Z Edward 4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-08-31 - Reiser4 format 4.0.1 (meta(data) checksums) [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 42a05fb43bf03569f3c52c689847b5323dd360ff 4077 4063 2015-08-31T15:47:12Z Edward 4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-08-31 - Reiser4 format 4.0.1 (meta(data) checksums) [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} d3520b12c35f94b90a658433d895dc9bf4e30dcf 4063 4062 2015-08-17T23:43:56Z Chris goe 2 https everywhere! __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-08-16 - Reiser4 meta(data) checksums [https://marc.info/?l=reiserfs-devel&m=143973133411637&w=2 announced] 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 1fa69b03046a1dbbb4c4b0afeb51ae4189620d1f 4062 4061 2015-08-17T21:58:49Z Edward 4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-08-16 - Reiser4 meta(data) checksums [http://marc.info/?l=reiserfs-devel&m=143973133411637&w=2 announced] 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 41f03b18f3697465bce58ad0e79587f98cbbcd42 4061 4060 2015-08-07T15:40:16Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} d07b14196434e1a4cb1e785b1c4f1860f7557b80 4060 4059 2015-06-07T02:01:04Z Chris goe 2 sourceforge link fixed __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 0983c5f59335d1170bf62f017d8e1aea28d89435 4059 4057 2015-06-05T14:42:33Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 9e6555e80caa1b999d3eafd5dc15eeaf53ba78d2 4057 4056 2015-04-20T09:15:58Z Chris goe 2 2013 moved to [[News_Archive]] __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 1bbb49424683d937dc234cb6835f30d1623301a3 4056 4041 2015-04-20T08:52:57Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} e9f95a623d755436c38ec2f32ad04336b258c443 4041 4031 2014-10-26T23:14:23Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 827fdfe5fe689434d82f61db563b65d1a3e1def4 4031 4021 2014-10-12T15:24:51Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 033150803efef2641644a0d7a4df9ba275dc3087 4021 3991 2014-08-24T12:34:04Z Edward 4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} f8395328c8c2b48090086e295e3aabdab317ab13 3991 3941 2014-08-11T11:30:51Z Edward 4 Update news __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} ebfb103ff866c5f4b95e5f326ea67269baffd2c8 3941 3841 2014-07-15T02:15:44Z Chris goe 2 reiser4progs: version-1.0.9 & libaal: version-1.0.6 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-06-29 - Patches for [http://www.spinics.net/lists/reiserfs-devel/msg04025.html reiser4progs-1.0.9] and [http://www.spinics.net/lists/reiserfs-devel/msg04024.html libaal-1.0.6] were posted! 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} a92d9937d495845558177256db7c2585e04e9636 3841 3821 2014-05-09T09:00:18Z Chris goe 2 Reiser4-enabled openSUSE kernels / old news moved to [[News Archive]] __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-05-09 - Glenn [http://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 19de62ed8caf5c39b396d7e5e0fa2ebc104f3dbb 3821 3811 2014-05-08T19:54:23Z Chris goe 2 2014-04-23: Jeff Mahoney published a big reiserfs cleanup patchset __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [http://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - Jeff Mahoney [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 created] a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} f16a7d5ca2368fbf2bb2b28d5f3f7749f47ff8b7 3811 3751 2014-05-08T15:51:11Z Edward 4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [http://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 985f367c18999cfc7c168643d9d8a42d14a78c88 3751 3741 2014-03-13T17:24:53Z Edward 4 __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-03-11 - Different transaction models in Reiser4 [http://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 8b511b980a50eaea1b4d9edca6a842984c27502d 3741 3731 2014-02-05T11:20:39Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} c6b3c5d1fed8def4928dfb8aff8465823fe5168a 3731 3711 2013-12-21T11:19:56Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 67831f58630f8e267a7af97a91eb46f80de2d67c 3711 3701 2013-09-23T19:03:42Z Chris goe 2 korg/git url fixed __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 75c639ec3f132d48e38805e3e136a62e61f8cd4e 3701 3665 2013-09-23T13:13:38Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 2201bfacc0480fe31482263624b6f547df94bd1f 3665 3664 2013-07-16T18:36:56Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 273d574fca2c49ec47ea7380e750f48c405a8a79 3664 3591 2013-07-16T18:34:24Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-05-25 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} d1918a5eb4d4b4034499634267b46b1f3e740049 3591 3581 2013-07-03T17:53:10Z Chris goe 2 moved to archive __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} d71cc1502f8c13a65d7df66cfcb6874c7d56c0f6 3581 3541 2013-07-03T17:51:42Z Chris goe 2 missing </div> __NOTOC__ {| width="100%" |- | style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 063a929f84efe566ad19943435c4404fdd5d5f8a 3541 2761 2013-07-03T08:41:00Z Chris goe 2 reiserfsprogs 3.6.23 released __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 1dec895bc8d6a0dde518ed5731a218ccf0c52722 2761 2711 2013-05-25T17:44:57Z Edward 4 __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 9721e01d358bfb2f31c512df3d3c2da1d5b9f06c 2711 2701 2013-05-05T01:53:33Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 5712b09c3984fc8342741038c9739a15c347028d 2701 2641 2013-04-05T18:50:44Z Edward 4 __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} c6919638968a1f9189f0a90ae9a3cbcdbe6491cf 2641 2631 2013-01-07T17:13:34Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} eb9fe3aea84f809066959565c2ae7171804f40c6 2631 2611 2013-01-07T17:10:02Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2013-01-07 - Reiser4 for Linux-3.7.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} c442c19276ab2a88875decb936b6a1bd75f2d147 2611 2601 2012-11-01T22:58:43Z Edward 4 /* News */ __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 7c9227d6a93d8e4636aa6dcbd8ea30771d264249 2601 2591 2012-10-18T19:49:46Z Chris goe 2 reiserfsprogs 3.6.22 released __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Edward] posted [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches for Linux 3.5] and [http://www.spinics.net/lists/reiserfs-devel/msg03220.html added]: <tt>"Make sure that you have the latest stuff (reiser4-for-2.6.39) and tail packing is turned off (format your partition with the option -o "formatting=extents")."</tt> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} fcfb15c6d514d2234e9cff3314c4ebd855b1a48e 2591 2461 2012-10-11T16:49:40Z Chris goe 2 reiserfsprogs has a new home __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2012-10-10 - [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 Jeff Mahoney] created a [https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs new home for reiserfsprogs] and a [https://git.kernel.org/?p=linux/kernel/git/jeffm/reiserfsprogs.git;a=summary Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Edward] posted [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches for Linux 3.5] and [http://www.spinics.net/lists/reiserfs-devel/msg03220.html added]: <tt>"Make sure that you have the latest stuff (reiser4-for-2.6.39) and tail packing is turned off (format your partition with the option -o "formatting=extents")."</tt> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 4c6e4d2e8339fb55034355112d6b3b3f5e18ea93 2461 2371 2012-09-25T17:27:36Z Chris goe 2 formatting cleanup __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dcf5ff;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#F8F8FF;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#fffff0;"> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="55%" style="vertical-align:top" | <!-- News Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#f0e0d0;"> === News === <div style="font-size:small"> 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Edward] posted [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches for Linux 3.5] and [http://www.spinics.net/lists/reiserfs-devel/msg03220.html added]: <tt>"Make sure that you have the latest stuff (reiser4-for-2.6.39) and tail packing is turned off (format your partition with the option -o "formatting=extents")."</tt> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> </div> <!-- Further Information Block --> <div style="margin-top:10px; border:1px solid #dfdfdf; padding:0em 1em 1em 1em; background-color:#dfefdf"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} b960e6ced943c100cd22dd8b3f9a0f6ffbe8aec5 2371 2351 2012-09-16T08:10:42Z Chris goe 2 __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Edward] posted [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches for Linux 3.5] and [http://www.spinics.net/lists/reiserfs-devel/msg03220.html added]: <tt>"Make sure that you have the latest stuff (reiser4-for-2.6.39) and tail packing is turned off (format your partition with the option -o "formatting=extents")."</tt> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> </div> |} dc36f91ed63c09d4736fe8b551c06bf5231cfcae 2351 2331 2012-09-16T08:05:51Z Chris goe 2 /* News */ __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Edward] posted [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches for Linux 3.5] and [http://www.spinics.net/lists/reiserfs-devel/msg03220.html added]: <tt>"Make sure that you have the latest stuff (reiser4-for-2.6.39) and tail packing is turned off (format your partition with the option -o "formatting=extents")."</tt> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. &rarr; See the [[News Archive]] for older news. </div> </div> |} abb313afc35a0037e0227c943ce3e7aa567bbce6 2331 2272 2012-09-16T08:02:00Z Chris goe 2 +updates __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Edward] posted [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches for Linux 3.5] and [http://www.spinics.net/lists/reiserfs-devel/msg03220.html added]: <tt>"Make sure that you have the latest stuff (reiser4-for-2.6.39) and tail packing is turned off (format your partition with the option -o "formatting=extents")."</tt> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - <s>[http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] </div> </div> |} 543c70f966985a21d7aeb98a6073c3868334ed67 2272 2262 2011-08-06T07:10:52Z Chris goe 2 404 __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - <s>[http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] </div> </div> |} 64378fcd9902d09cc8f2bd9bf57f62a54de73b80 2262 2252 2011-04-21T19:20:23Z Chris goe 2 . __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13. :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support] 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] </div> </div> |} aca0fc9bfff766e174832db64207804976c78685 2252 2242 2011-04-21T19:19:46Z Chris goe 2 Glenn posted Reiser4-enabled kernel RPMs for openSUSE 11.4 - please test! __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4 - please test! 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13. :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support] 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] </div> </div> |} cd2c94b61ca18360d63f6a170c6adbcffd0085b1 2242 2192 2011-04-04T17:49:16Z Chris goe 2 2.6.38 __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13. :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support] 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] </div> </div> |} 3b304ee80b3c49311f3e3799defb6957939766ac 2192 2172 2011-03-07T18:33:38Z Chris goe 2 2.6.37 __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13. :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support] 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] </div> </div> |} 0a2c377108049749a97a00b3263a8dac896b3a25 2172 2062 2010-11-21T02:40:04Z Chris goe 2 Reiser4 for Linux-2.6.36 __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13. :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support] 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] </div> </div> |} d852a01e65bebb36a2ff7c3608060b8939c3105b 2062 1922 2010-10-27T22:45:12Z Chris goe 2 one old newsitem resurrected :) __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2010-10-18 - [http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13. :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support] 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] </div> </div> |} d46211d53146c5f6a08ee51c37ad90c7b9d2b0f0 1922 1912 2010-10-27T22:04:00Z Chris goe 2 namespace cleanup __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small"> 2010-10-18 - [http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13. :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support] 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] </div> </div> |} 4e2e27d02bd24c7ac3160f84fe9060619fc44076 1912 1695 2010-10-18T01:17:57Z Chris goe 2 moved to Sitenotice __NOTOC__ {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> |} 1b2c33ece9fb93f9318ce8bfd4b09f82cdb8678b 1695 1615 2010-04-16T06:57:06Z Chris goe 2 formatting fixes __NOTOC__ <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"><br> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> |} 6017718ae241591d383101cb25a582c90422d0ae 1615 1454 2009-07-21T06:25:51Z Chris goe 2 info moved to :sitenotice __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} b0a02ca4d9014f640ab2b4235c498f822b61067d 1454 1429 2009-06-27T03:54:25Z Chris goe 2 reiser4progs split __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. * For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). * There was also a [http://web.archive.org/web/20070706050724/http://pub.namesys.com/ Reiser4 Wiki] (archive.org, 2007-07-06) once on <tt>pub.namesys.com</tt>. </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[reiser4progs]] * [[reiserfsprogs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 8eacce06206a3374e911e9b4850ad39b7e527f93 1429 1411 2009-06-27T01:19:01Z Chris goe 2 -> FAQ __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. * For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). * There was also a [http://web.archive.org/web/20070706050724/http://pub.namesys.com/ Reiser4 Wiki] (archive.org, 2007-07-06) once on <tt>pub.namesys.com</tt>. </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[FAQ|Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 8434695830dfbf219695e49994ce2a83273688ce 1411 1405 2009-06-26T21:37:00Z Chris goe 2 pub.namesys.com wiki added __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. * For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). * There was also a [http://web.archive.org/web/20070706050724/http://pub.namesys.com/ Reiser4 Wiki] (archive.org, 2007-07-06) once on <tt>pub.namesys.com</tt>. </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} 8ea5bf7b857bbacff8134ac919106323f1139e8f 1405 1401 2009-06-26T21:13:02Z Chris goe 2 more news :) __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="60%" style="vertical-align:top" | <!-- Wiki News Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists]] * [[IRC|IRC Channels]] </div> |} ca4e8ff9d994014a8dc96daa59cf9a999c733797 1401 1353 2009-06-26T20:43:02Z Chris goe 2 ...and packages. __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] and packages * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] * [[IRC|IRC Channels ]] </div> |} 594c6199f5749f9939af54febc9f55509ba4fe4e 1353 1347 2009-06-25T08:31:58Z Chris goe 2 -list __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[Bugs]] * [[TODO]] list * [[Credits]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] * [[IRC|IRC Channels ]] </div> |} ea35fac661229c9d22d2407621d6369dfa487719 1347 1344 2009-06-25T08:19:48Z Chris goe 2 take #2 __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[Bugs]] * [[TODO list]] * [[Credits]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] * [[IRC|IRC Channels ]] </div> |} c5b851f03f762f55e94de9f86210a96ab675896b 1344 1342 2009-06-25T08:10:25Z Chris goe 2 credits added __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[Bugs]] * [[TODO list]] * [[Credits]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News}} </div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] * [[IRC|IRC Channels ]] </div> |} d379cfa3f3e0867824cd6a2a17285a12804f297b 1342 1337 2009-06-25T08:07:30Z Chris goe 2 Testimonials added __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Manpages]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Testimonials]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News}} </div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] * [[IRC|IRC Channels ]] </div> |} 1f2b1f7bb033206b87fedb798b8da4fc1fe42779 1337 1324 2009-06-25T07:56:58Z Chris goe 2 Benchmarks added __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Publications | Articles and Publications]] * [[Benchmarks]] * [[Manpages]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News}} </div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] * [[IRC|IRC Channels ]] </div> |} 484e4d55b2057862783114317cfd3952da8b22ff 1324 1321 2009-06-25T07:27:18Z Chris goe 2 manpages added __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[Publications | Articles and Publications]] * [[Manpages]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News}} </div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] * [[IRC|IRC Channels ]] </div> |} 9c437a25830a9a4a4d2a70a90d2608670a2633b9 1321 1320 2009-06-25T07:23:05Z Chris goe 2 junk removed __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News}} </div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] * [[IRC|IRC Channels ]] </div> |} 627fdff586cb2babfd701059f2d3ed9b5b0b4b14 1320 1317 2009-06-25T07:22:37Z Chris goe 2 junk removed __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News}} </div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} fa912487db6c71cc12e2edab2510a7eece181837 1317 1312 2009-06-25T07:19:51Z Chris goe 2 __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News}} </div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} 7bd38ea5e70e1990ab38c073537344b804c60ed5 1312 1311 2009-06-25T07:00:18Z Chris goe 2 __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News}} </div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} 491e8086cd03c2ef05d77ca4bca7a6a63183c8a0 1311 1303 2009-06-25T06:58:15Z Chris goe 2 /* News */ __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> * 2009-06-22: [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 Reiser4 patches for Linux 2.6.30 released!] (Thanks, Edward!) === Further Information === * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} 82a7d5cf6b6e307501ed57a75476572961b84ac2 1303 1298 2009-06-25T06:39:32Z Chris goe 2 no confcall __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} d513cdb07da7f48de968325b61e1081d8759241a 1298 1297 2009-06-25T06:14:31Z Chris goe 2 /* Documentation */ __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Reiser4 Developer's Conference Call]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} f2eaf0f6d46416a1bd98cd31043b41fad0d57e49 1297 1295 2009-06-25T06:14:16Z Chris goe 2 /* Further Information */ __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Reiser4 Developer's Conference Call]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} 78f761a17cc1a497352536656be6553d19497416 1295 1290 2009-06-25T06:05:20Z Chris goe 2 help link removed __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Reiser4 Developer's Conference Call]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} 56a3d1142ee7d9d25e161d5eccf8b6f054bb4e3b 1290 1289 2009-06-25T05:56:58Z Chris goe 2 /* Utilities */ __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * [[Reiser4progs|reiserfsprogs/reiser4progs]] * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Reiser4 Developer's Conference Call]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} ---- Information about editing this wiki: [[Help:Editing]] b06d07d83300e19fe4e9298fc3d52e0ba8ab4904 1289 1285 2009-06-25T05:55:22Z Chris goe 2 /* Documentation */ __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with ReiserFS/Reiser4]] * [[Frequently Asked Questions]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Reiser4 Developer's Conference Call]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} ---- Information about editing this wiki: [[Help:Editing]] b4b71a9a0ad10268a1d5f39499c8aee7293b9498 1285 1284 2009-06-25T05:50:10Z Chris goe 2 archiv.org link added __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''Reiser4 Wiki''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. NOTE: For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with Reiser4]] * [[Considerations when creating Reiser4 filesystems]] * [[Frequently Asked Questions]] * [[Reiser4_Design| Reiser4 design]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Reiser4 Developer's Conference Call]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} ---- Information about editing this wiki: [[Help:Editing]] 0917d5cdd6f96a7853a92225c22c5c76a1fa5cda 1284 1282 2009-06-25T05:44:56Z Chris goe 2 initial layout, stolen from the ext4 wiki site. __NOTOC__ {| |- | nowrap style="vertical-align: top; font: bold xx-large sans-serif; " | Reiser4 (and ReiserFS) Wiki |} <!-- Welcome Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fff0e0; align:right;"> Welcome to the '''[[Reiser4:About|Reiser4 Wiki]]''', the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. If you are trying to find out how to get started with Reiser4, please see the [[Reiser4_Howto]]. Please [[Reiser4:Support|support us]] and [[:Category:NeedsEditing|help to extend]] this wiki. Thank you! <div align="right"><small>'''[[Reiser4:About|More about the Reiser4 Wiki]]'''</small></div> </div> {| width="100%" |- |style="vertical-align:top" | <!-- Documentation Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#dcf5ff; align:right;"> === Documentation === * [[Reiser4_Howto | Getting started with Reiser4]] * [[Considerations when creating Reiser4 filesystems]] * [[Frequently Asked Questions]] * [http://kernelnewbies.org/Reiser4 Kernelnewbie.org's Reiser4 article (description of Reiser4 features and instructions on how to use it)] * [[Reiser4_Design| Reiser4 design]] </div> <!-- Utilities Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;"> === Utilities === * * [[Filesystem Testing Tools]] </div> <!-- Development Block --> <div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; "> === Development === * [[Reiser4 patchsets]] * [[New Reiser4 features]] * [[reiserfsprogs features and patches]] * [[Reiser4 Developer's Conference Call]] * [[Testing Results]] * [[Bugs]] * [[TODO list]] </div> | width="50%" style="vertical-align:top" | <!-- Wiki News Block : DO NOT EDIT HERE --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 3em; background-color:#f0e0d0; align:left; text-indent:-2em;"> === News === <div style="font-size:small">{{Reiser4:News_Contents}} </div> <div align="right"><small>'''More [[Reiser4:News|News]]'''</small></div> </div> <!-- Further Information Block --> <div style="margin:0; margin-top:10px; border:1px solid #dfdfdf; padding: 0em 1em 1em 1em; background-color:#dfefdf; align:left; margin-top:10px"> === Further Information === * [[publications|Articles and Publications ]] * [[:Category:Glossary|Glossary]] * [[mailinglists|Mailing Lists ]] *[[links|Links to other ReiserFS/Reiser4 resources on the web]] * [[IRC|IRC Channels ]] </div> |} ---- Information about editing this wiki: [[Help:Editing]] c17b44b9e58cfeb2961586ce143fce85e847a3df 1282 1 2006-09-10T22:10:06Z 69.25.196.29 0 Add "this is the default wiki page" <big>'''MediaWiki has been successfully installed.'''</big> This is the default wiki page for *.wiki.kernel.org. Consult the [http://meta.wikipedia.org/wiki/MediaWiki_User%27s_Guide User's Guide] for information on using the wiki software. == Getting started == * [http://www.mediawiki.org/wiki/Help:Configuration_settings Configuration settings list] * [http://www.mediawiki.org/wiki/Help:FAQ MediaWiki FAQ] * [http://mail.wikipedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list] 09c75bce16753187f943b4374763d952e99d1689 1 2006-08-29T14:11:07Z MediaWiki default 0 <big>'''MediaWiki has been successfully installed.'''</big> Consult the [http://meta.wikipedia.org/wiki/MediaWiki_User%27s_Guide User's Guide] for information on using the wiki software. == Getting started == * [http://www.mediawiki.org/wiki/Help:Configuration_settings Configuration settings list] * [http://www.mediawiki.org/wiki/Help:FAQ MediaWiki FAQ] * [http://mail.wikipedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list] cfe9e457290978db5c59f7ed0e6495177716a20e Manpages 0 21 2491 1693 2012-09-25T17:36:11Z Chris goe 2 -> = reiser4 = * [[debugfs.reiser4]] * [[fsck.reiser4]] * [[measurefs.reiser4]] * [[mkfs.reiser4]] * [[mount4|mount options]] = reiserfs = * [[debugreiserfs]] * [[mkreiserfs]] * [[reiserfsck]] * [[reiserfstune]] * [[resize_reiserfs]] * [[mount|mount options]] [[category:Reiser4]] [[category:ReiserFS]] 5c69874b0982c29b67b371cd64de03811bce0a85 1693 1502 2010-04-16T04:36:23Z Chris goe 2 wording === reiser4 === * [[debugfs.reiser4]] * [[fsck.reiser4]] * [[measurefs.reiser4]] * [[mkfs.reiser4]] * [[mount4|mount options]] === reiserfs === * [[debugreiserfs]] * [[mkreiserfs]] * [[reiserfsck]] * [[reiserfstune]] * [[resize_reiserfs]] * [[mount|mount options]] [[category:Reiser4]] [[category:ReiserFS]] 5f16d5849aa66ffb02cbd246bd89ff23eb5e90d6 1502 1501 2009-06-27T17:11:10Z Chris goe 2 reiser4 mount options === reiser4 === * [[debugfs.reiser4]] * [[fsck.reiser4]] * [[measurefs.reiser4]] * [[mkfs.reiser4]] * [[mount4|mount]] (options) === reiserfs === * [[debugreiserfs]] * [[mkreiserfs]] * [[reiserfsck]] * [[reiserfstune]] * [[resize_reiserfs]] * [[mount]] (Mount options) [[category:Reiser4]] [[category:ReiserFS]] 8bc2628626f4293b04b2e81e103a46012ae6bdc5 1501 1460 2009-06-27T17:10:19Z Chris goe 2 reiser4 manpages added (empty) === reiser4 === * [[debugfs.reiser4]] * [[fsck.reiser4]] * [[measurefs.reiser4]] * [[mkfs.reiser4]] === reiserfs === * [[debugreiserfs]] * [[mkreiserfs]] * [[reiserfsck]] * [[reiserfstune]] * [[resize_reiserfs]] * [[mount]] (Mount options) [[category:Reiser4]] [[category:ReiserFS]] 8112d2a0f603389cf5443218a574b127ad1f9f34 1460 1325 2009-06-27T04:01:40Z Chris goe 2 reiser4/reiserfs === reiser4 === TBD === reiserfs === * [[mkreiserfs]] * [[reiserfsck]] * [[resize_reiserfs]] * [[reiserfstune]] * [[debugreiserfs]] * [[mount]] (Mount options) [[category:Reiser4]] [[category:ReiserFS]] dbc7a098e8e4b2ad3a0753e9883251fa25df73a1 1325 2009-06-25T07:29:21Z Chris goe 2 Created page with '=== man pages === * [[mkreiserfs]] * [[reiserfsck]] * [[resize_reiserfs]] * [[reiserfstune]] * [[debugreiserfs]] * [[mount]] (Mount options) [[category:ReiserFS]]' === man pages === * [[mkreiserfs]] * [[reiserfsck]] * [[resize_reiserfs]] * [[reiserfstune]] * [[debugreiserfs]] * [[mount]] (Mount options) [[category:ReiserFS]] fb87f48d420f89fb1814a1ce0cb4549a3149aa8f Measurefs.reiser4 0 85 1681 1677 2010-02-10T11:33:45Z Chris goe 2 category added === NAME === measurefs.reiser4 - the program for measuring reiser4 filesystem parameters (fragmentation, node packing, etc.). === SYNOPSIS === measurefs.reiser4 [ options ] FILE === DESCRIPTION === measurefs.reiser4 is reiser4 filesystem measure program. You can estimate reiser4 filesystem fragmentation, packingm etc. structures by using it. === COMMON OPTIONS === -V, --version prints program version. -?, -h, --help prints program help. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces measurefs to use whole disk, not block device or mounted partition. -c, --cache N sets tree cache node number to passed value. This affects very much behavior of libreiser4. It affects speed, tree allocation, etc. === MEASUREMENT OPTIONS === -S, --tree-stat shows different tree statistics (node packing, internal nodes, leaves, etc) -T, --tree-frag measures total tree fragmentation. The result is fragmentation factor - value from 0.00000 (minimal fragmentation) to 1.00000 (maximal one). Most probably, this factor may affect sequential read performance. -D, --data-frag measures average files fragmentation. This means, that fragmentation of each file in filesystem will be measured separately and results will be averaged. The result is fragmentation factor - value from 0.00000 (minimal fragmentation) to 1.00000 (maximal one). Note, that for the fresh filesystem (created not very long time ago) and even fully filled by data, this value will be pretty small. -F, --file-frag FILE measures fragmentation of the specified file. The result is fragmentation factor - value from 0.00000 (minimal fragmentation) to 1.00000 (maximal one). Note, that fragmentation of a small file (depends of used tail policy), which consists of tail items, is not very reliable value. That is because, they is always afoot due to balancing. -E, --show-file show file fragmentation for each file if --data-frag is specified. === PLUGIN OPTIONS === -p, --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 know about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === EXAMPLES === measurefs.reiser4 -o nodeptr=nodeptr41,hash=rupasov_hash /dev/hda2 measurefs.reiser4 -F /usr/bin /dev/hda2 measurefs.reiser4 -F /bin/bash /dev/hda2 === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[debugfs.reiser4|debugfs.reiser4(8)]] * [[mkfs.reiser4|mkfs.reiser4(8)]] * [[fsck.reiser4|fsck.reiser4(8)]] === AUTHOR === This manual page was written by Yury Umanets <umka@namesys.com> [[category:Reiser4]] 2c62511507ce392a01e503311c34e24c3f49660a 1677 2010-02-10T11:05:00Z Chris goe 2 Created page with '=== NAME === measurefs.reiser4 - the program for measuring reiser4 filesystem parameters (fragmentation, node packing, etc.). === SYNOPSIS === measurefs.reiser4 [ opti…' === NAME === measurefs.reiser4 - the program for measuring reiser4 filesystem parameters (fragmentation, node packing, etc.). === SYNOPSIS === measurefs.reiser4 [ options ] FILE === DESCRIPTION === measurefs.reiser4 is reiser4 filesystem measure program. You can estimate reiser4 filesystem fragmentation, packingm etc. structures by using it. === COMMON OPTIONS === -V, --version prints program version. -?, -h, --help prints program help. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces measurefs to use whole disk, not block device or mounted partition. -c, --cache N sets tree cache node number to passed value. This affects very much behavior of libreiser4. It affects speed, tree allocation, etc. === MEASUREMENT OPTIONS === -S, --tree-stat shows different tree statistics (node packing, internal nodes, leaves, etc) -T, --tree-frag measures total tree fragmentation. The result is fragmentation factor - value from 0.00000 (minimal fragmentation) to 1.00000 (maximal one). Most probably, this factor may affect sequential read performance. -D, --data-frag measures average files fragmentation. This means, that fragmentation of each file in filesystem will be measured separately and results will be averaged. The result is fragmentation factor - value from 0.00000 (minimal fragmentation) to 1.00000 (maximal one). Note, that for the fresh filesystem (created not very long time ago) and even fully filled by data, this value will be pretty small. -F, --file-frag FILE measures fragmentation of the specified file. The result is fragmentation factor - value from 0.00000 (minimal fragmentation) to 1.00000 (maximal one). Note, that fragmentation of a small file (depends of used tail policy), which consists of tail items, is not very reliable value. That is because, they is always afoot due to balancing. -E, --show-file show file fragmentation for each file if --data-frag is specified. === PLUGIN OPTIONS === -p, --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 know about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === EXAMPLES === measurefs.reiser4 -o nodeptr=nodeptr41,hash=rupasov_hash /dev/hda2 measurefs.reiser4 -F /usr/bin /dev/hda2 measurefs.reiser4 -F /bin/bash /dev/hda2 === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[debugfs.reiser4|debugfs.reiser4(8)]] * [[mkfs.reiser4|mkfs.reiser4(8)]] * [[fsck.reiser4|fsck.reiser4(8)]] === AUTHOR === This manual page was written by Yury Umanets <umka@namesys.com> 5c00522cb5e3c708cae3c7c81eced5e6c74e6355 Mkfs.reiser4 0 86 4071 1684 2015-08-30T16:28:20Z Edward 4 add "-d, --discard" mkfs option to the manpage === NAME === mkfs.reiser4 - the program for creating reiser4 filesystems === SYNOPSIS === mkfs.reiser4 [ options ] FILE1 FILE2 ... [ size[K|M|G] ] === DESCRIPTION === mkfs.reiser4 is reiser4 filesystem creation program. It is based on new libreiser4 library. Since libreiser4 is fully plugin-based, we have the potential to create not just reiser4 partitions, but any filesystem or database format, which is based on balanced trees. === COMMON OPTIONS === -V, --version prints program version. -?, -h, --help prints program help. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces mkfs to use whole disk, not block device or mounted partition. === MKFS OPTIONS === -b, --block-size N block size to be used (architecture page size by default) -L, --label LABEL volume label to be used -U, --uuid UUID universally unique identifier to be used -s, --lost-found forces mkfs to create lost+found directory. -d, --discard tells mkfs to discard given device before creating the filesystem (for solid state drives). === PLUGIN OPTIONS === -p, --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 know about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === EXAMPLES === Assign short key plugin to "key" field in order to create filesystem with short keys policy: mkfs.reiser4 -yf -o key=key_short /dev/hda2 === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[measurefs.reiser4|measurefs.reiser4(8)]] * [[debugfs.reiser4|debugfs.reiser4(8)]] * [[fsck.reiser4|fsck.reiser4(8)]] === AUTHOR === This manual page was written by Yury Umanets <umka@namesys.com> [[category:Reiser4]] 88a7ecd6db96d37c080cd73ee8e98a6c9a7cbfe9 1684 1678 2010-02-10T11:34:11Z Chris goe 2 category added === NAME === mkfs.reiser4 - the program for creating reiser4 filesystems === SYNOPSIS === mkfs.reiser4 [ options ] FILE1 FILE2 ... [ size[K|M|G] ] === DESCRIPTION === mkfs.reiser4 is reiser4 filesystem creation program. It is based on new libreiser4 library. Since libreiser4 is fully plugin-based, we have the potential to create not just reiser4 partitions, but any filesystem or database format, which is based on balanced trees. === COMMON OPTIONS === -V, --version prints program version. -?, -h, --help prints program help. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces mkfs to use whole disk, not block device or mounted partition. === MKFS OPTIONS === -b, --block-size N block size to be used (architecture page size by default) -L, --label LABEL volume label to be used -U, --uuid UUID universally unique identifier to be used -s, --lost-found forces mkfs to create lost+found directory. === PLUGIN OPTIONS === -p, --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 know about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === EXAMPLES === Assign short key plugin to "key" field in order to create filesystem with short keys policy: mkfs.reiser4 -yf -o key=key_short /dev/hda2 === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[measurefs.reiser4|measurefs.reiser4(8)]] * [[debugfs.reiser4|debugfs.reiser4(8)]] * [[fsck.reiser4|fsck.reiser4(8)]] === AUTHOR === This manual page was written by Yury Umanets <umka@namesys.com> [[category:Reiser4]] 7e8a4b58a5ea241024db99e44568081f4bbc6a08 1678 2010-02-10T11:10:53Z Chris goe 2 Created page with '=== NAME === mkfs.reiser4 - the program for creating reiser4 filesystems === SYNOPSIS === mkfs.reiser4 [ options ] FILE1 FILE2 ... [ size[K|M|G] ] === DESCRIPTION ==…' === NAME === mkfs.reiser4 - the program for creating reiser4 filesystems === SYNOPSIS === mkfs.reiser4 [ options ] FILE1 FILE2 ... [ size[K|M|G] ] === DESCRIPTION === mkfs.reiser4 is reiser4 filesystem creation program. It is based on new libreiser4 library. Since libreiser4 is fully plugin-based, we have the potential to create not just reiser4 partitions, but any filesystem or database format, which is based on balanced trees. === COMMON OPTIONS === -V, --version prints program version. -?, -h, --help prints program help. -y, --yes assumes an answer ’yes’ to all questions. -f, --force forces mkfs to use whole disk, not block device or mounted partition. === MKFS OPTIONS === -b, --block-size N block size to be used (architecture page size by default) -L, --label LABEL volume label to be used -U, --uuid UUID universally unique identifier to be used -s, --lost-found forces mkfs to create lost+found directory. === PLUGIN OPTIONS === -p, --print-profile prints the plugin profile. This is the set of default plugins used for all parts of a filesystem -- format, nodes, files, directories, hashes, etc. If --override is specified, then prints modified plugins. -l, --print-plugins prints all plugins libreiser4 know about. -o, --override TYPE=PLUGIN, ... overrides the default plugin of the type "TYPE" by the plugin "PLUGIN" in the plugin profile. === EXAMPLES === Assign short key plugin to "key" field in order to create filesystem with short keys policy: mkfs.reiser4 -yf -o key=key_short /dev/hda2 === REPORTING BUGS === Report bugs to {{listaddress}} === SEE ALSO === * [[measurefs.reiser4|measurefs.reiser4(8)]] * [[debugfs.reiser4|debugfs.reiser4(8)]] * [[fsck.reiser4|fsck.reiser4(8)]] === AUTHOR === This manual page was written by Yury Umanets <umka@namesys.com> 67b21263b7ade9abc9f9353ed193fc17ebe4028f Mkreiserfs 0 22 1552 1530 2009-07-02T19:59:05Z Chris goe 2 listaddress added === NAME === mkreiserfs - The tool to create a [[ReiserFS]] filesystem. === SYNOPSIS === mkreiserfs [ -dfV ] [ -b | --block-size N ] [ -h | --hash HASH ] [ -u | --uuid UUID ] [ -l | --label LABEL ] [ --format FORMAT ] [ -q | --quiet ] [ -j | --journal-device FILE ] [ -s | --journal-size N ] [ -o | --journal-offset N ] [ -t | --transaction-max-size N ] [ -B | --badblocks file ] ''device'' [ ''filesystem-size'' ] === DESCRIPTION === <tt>mkreiserfs</tt> creates a ReiserFS filesystem on a device (usually a disk partition). ''device'' is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). ''filesystem-size'' is the size in blocks of the filesystem. If omitted, <tt>mkreiserfs</tt> will automatically set it. === OPTIONS === -b | --block-size N N is block size in bytes. It may only be set to a power of 2 within the 512-8192 interval. -h | --hash HASH HASH specifies which hash function will sort the names in the directories. Choose from r5, rupasov, or tea. r5 is the default one. --format FORMAT FORMAT specifies the format for the new filsystem. Choose format 3.5 or 3.6. If none is specified mkreiserfs will create format 3.6 if running kernel is 2.4 or higher, and format 3.5 if kernel 2.2 is running, and will refuse creation under all other kernels. -u | --uuid UUID Sets the Universally Unique IDentifier of the filesystem to UUID (see also [http://manpages.ubuntu.com/manpages/karmic/en/man1/uuidgen.1.html uuidgen(1)]). The format of the UUID is a series of hex digits separated by hypthens, e.g.: "c1b9d5a2-f162-11cf9ece-0020afc76f16". If the option is skipped, mkreiserfs will by default generate a new UUID. -l | --label LABEL Sets the volume label of the filesystem. LABEL can at most be 16 characters long; if it is longer than 16 characters, mkreiserfs will truncate it. -q | --quiet Sets mkreiserfs to work quietly without producing messages, progress or questions. It is useful, but only for use by end users, if you run mkreiserfs in a script. -j | --journal-device FILE FILE is the name of the block device on which is to be places the filesystem journal. -o | --journal-offset N N is the offset where the journal starts when it is to be on a separate device. Default is 0. N has no effect when the journal is to be on the host device. -s | --journal-size N N is the size of the journal in blocks. When the journal is to be on a separate device, its size defaults to the number of blocks that the device has. When journal is to be on the host device, its size defaults to 8193 and the maximal possible size is 32749 (for blocksize 4k). The minimum size is 513 blocks (whether the journal is on the host or on a separate device). -t | --transaction-max-size N N is the maximum transaction size parameter for the journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specified incorrectly, it will automatically be adjusted. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the filesystem. This list can be created by [[FAQ/bad-block-handling|badblocks]] -b block-size device. -f Forces mkreiserfs to continue even when the device is the whole disk, looks mounted, or is not a block device. If -f is specified more than once, it allows the user to avoid asking for confirmation. -d Sets mkreiserfs to print debugging information during mkreiserfs. -V Prints the version and then exits. === AUTHOR === This version of mkreiserfs has been written by Edward Shishkin <edward.shishkin@gmail.com>. === BUGS === Please report bugs to the ReiserFS developers {{listaddress}}, providing as much information as possible - your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [[reiserfstune|reiserfstune(8)]] [[category:ReiserFS]] 831ba58b0f9237e3770398a8b1b1bab1f111efdf 1530 1529 2009-06-27T20:57:52Z Chris goe 2 NAME added === NAME === mkreiserfs - The tool to create a [[ReiserFS]] filesystem. === SYNOPSIS === mkreiserfs [ -dfV ] [ -b | --block-size N ] [ -h | --hash HASH ] [ -u | --uuid UUID ] [ -l | --label LABEL ] [ --format FORMAT ] [ -q | --quiet ] [ -j | --journal-device FILE ] [ -s | --journal-size N ] [ -o | --journal-offset N ] [ -t | --transaction-max-size N ] [ -B | --badblocks file ] ''device'' [ ''filesystem-size'' ] === DESCRIPTION === <tt>mkreiserfs</tt> creates a ReiserFS filesystem on a device (usually a disk partition). ''device'' is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). ''filesystem-size'' is the size in blocks of the filesystem. If omitted, <tt>mkreiserfs</tt> will automatically set it. === OPTIONS === -b | --block-size N N is block size in bytes. It may only be set to a power of 2 within the 512-8192 interval. -h | --hash HASH HASH specifies which hash function will sort the names in the directories. Choose from r5, rupasov, or tea. r5 is the default one. --format FORMAT FORMAT specifies the format for the new filsystem. Choose format 3.5 or 3.6. If none is specified mkreiserfs will create format 3.6 if running kernel is 2.4 or higher, and format 3.5 if kernel 2.2 is running, and will refuse creation under all other kernels. -u | --uuid UUID Sets the Universally Unique IDentifier of the filesystem to UUID (see also [http://manpages.ubuntu.com/manpages/karmic/en/man1/uuidgen.1.html uuidgen(1)]). The format of the UUID is a series of hex digits separated by hypthens, e.g.: "c1b9d5a2-f162-11cf9ece-0020afc76f16". If the option is skipped, mkreiserfs will by default generate a new UUID. -l | --label LABEL Sets the volume label of the filesystem. LABEL can at most be 16 characters long; if it is longer than 16 characters, mkreiserfs will truncate it. -q | --quiet Sets mkreiserfs to work quietly without producing messages, progress or questions. It is useful, but only for use by end users, if you run mkreiserfs in a script. -j | --journal-device FILE FILE is the name of the block device on which is to be places the filesystem journal. -o | --journal-offset N N is the offset where the journal starts when it is to be on a separate device. Default is 0. N has no effect when the journal is to be on the host device. -s | --journal-size N N is the size of the journal in blocks. When the journal is to be on a separate device, its size defaults to the number of blocks that the device has. When journal is to be on the host device, its size defaults to 8193 and the maximal possible size is 32749 (for blocksize 4k). The minimum size is 513 blocks (whether the journal is on the host or on a separate device). -t | --transaction-max-size N N is the maximum transaction size parameter for the journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specified incorrectly, it will automatically be adjusted. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the filesystem. This list can be created by [[FAQ/bad-block-handling|badblocks]] -b block-size device. -f Forces mkreiserfs to continue even when the device is the whole disk, looks mounted, or is not a block device. If -f is specified more than once, it allows the user to avoid asking for confirmation. -d Sets mkreiserfs to print debugging information during mkreiserfs. -V Prints the version and then exits. === AUTHOR === This version of mkreiserfs has been written by Edward Shishkin <edward.shishkin@gmail.com>. === BUGS === Please [[mailinglists|report bugs to the ReiserFS developers]], providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [[reiserfstune|reiserfstune(8)]] [[category:ReiserFS]] 610fee2827422a02830a4a8b11159f7a2b2eed28 1529 1327 2009-06-27T20:56:53Z Chris goe 2 formatting fixes mkreiserfs - The tool to create a [[ReiserFS]] filesystem. === SYNOPSIS === mkreiserfs [ -dfV ] [ -b | --block-size N ] [ -h | --hash HASH ] [ -u | --uuid UUID ] [ -l | --label LABEL ] [ --format FORMAT ] [ -q | --quiet ] [ -j | --journal-device FILE ] [ -s | --journal-size N ] [ -o | --journal-offset N ] [ -t | --transaction-max-size N ] [ -B | --badblocks file ] ''device'' [ ''filesystem-size'' ] === DESCRIPTION === <tt>mkreiserfs</tt> creates a ReiserFS filesystem on a device (usually a disk partition). ''device'' is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). ''filesystem-size'' is the size in blocks of the filesystem. If omitted, <tt>mkreiserfs</tt> will automatically set it. === OPTIONS === -b | --block-size N N is block size in bytes. It may only be set to a power of 2 within the 512-8192 interval. -h | --hash HASH HASH specifies which hash function will sort the names in the directories. Choose from r5, rupasov, or tea. r5 is the default one. --format FORMAT FORMAT specifies the format for the new filsystem. Choose format 3.5 or 3.6. If none is specified mkreiserfs will create format 3.6 if running kernel is 2.4 or higher, and format 3.5 if kernel 2.2 is running, and will refuse creation under all other kernels. -u | --uuid UUID Sets the Universally Unique IDentifier of the filesystem to UUID (see also [http://manpages.ubuntu.com/manpages/karmic/en/man1/uuidgen.1.html uuidgen(1)]). The format of the UUID is a series of hex digits separated by hypthens, e.g.: "c1b9d5a2-f162-11cf9ece-0020afc76f16". If the option is skipped, mkreiserfs will by default generate a new UUID. -l | --label LABEL Sets the volume label of the filesystem. LABEL can at most be 16 characters long; if it is longer than 16 characters, mkreiserfs will truncate it. -q | --quiet Sets mkreiserfs to work quietly without producing messages, progress or questions. It is useful, but only for use by end users, if you run mkreiserfs in a script. -j | --journal-device FILE FILE is the name of the block device on which is to be places the filesystem journal. -o | --journal-offset N N is the offset where the journal starts when it is to be on a separate device. Default is 0. N has no effect when the journal is to be on the host device. -s | --journal-size N N is the size of the journal in blocks. When the journal is to be on a separate device, its size defaults to the number of blocks that the device has. When journal is to be on the host device, its size defaults to 8193 and the maximal possible size is 32749 (for blocksize 4k). The minimum size is 513 blocks (whether the journal is on the host or on a separate device). -t | --transaction-max-size N N is the maximum transaction size parameter for the journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specified incorrectly, it will automatically be adjusted. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the filesystem. This list can be created by [[FAQ/bad-block-handling|badblocks]] -b block-size device. -f Forces mkreiserfs to continue even when the device is the whole disk, looks mounted, or is not a block device. If -f is specified more than once, it allows the user to avoid asking for confirmation. -d Sets mkreiserfs to print debugging information during mkreiserfs. -V Prints the version and then exits. === AUTHOR === This version of mkreiserfs has been written by Edward Shishkin <edward.shishkin@gmail.com>. === BUGS === Please [[mailinglists|report bugs to the ReiserFS developers]], providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [[reiserfstune|reiserfstune(8)]] [[category:ReiserFS]] 936d5f20efe9aa56d27f7e2c1365fe0aae7b84d9 1327 1326 2009-06-25T07:34:09Z Chris goe 2 mkreiserfs - The create tool for the Linux ReiserFS filesystem. === SYNOPSIS === <pre> mkreiserfs [ -dfV ] [ -b | --block-size N ] [ -h | --hash HASH ] [ -u | --uuid UUID ] [ -l | --label LABEL ] [ --format FORMAT ] [ -q | --quiet ] [ -j | --journal-device FILE ] [ -s | --journal-size N ] [ -o | --journal-offset N ] [ -t | --transaction-max-size N ] [ -B | --badblocks file ] device [ filesystem-size ]</pre> === DESCRIPTION === mkreiserfs creates a Linux ReiserFS filesystem on a device (usually a disk partition). device is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). filesystem-size is the size in blocks of the filesystem. If omitted, mkreiserfs will automatically set it. === OPTIONS === -b | --block-size N N is block size in bytes. It may only be set to a power of 2 within the 512-8192 interval. -h | --hash HASH HASH specifies which hash function will sort the names in the directories. Choose from r5, rupasov, or tea. r5 is the default one. --format FORMAT FORMAT specifies the format for the new filsystem. Choose format 3.5 or 3.6. If none is specified mkreiserfs will create format 3.6 if running kernel is 2.4 or higher, and format 3.5 if kernel 2.2 is running, and will refuse creation under all other kernels. -u | --uuid UUID Sets the Universally Unique IDentifier of the filesystem to UUID (see also uuidgen(8)). The format of the UUID is a series of hex digits separated by hypthens, e.g.: "c1b9d5a2-f162-11cf9ece-0020afc76f16". If the option is skipped, mkreiserfs will by default generate a new UUID. -l | --label LABEL Sets the volume label of the filesystem. LABEL can at most be 16 characters long; if it is longer than 16 characters, mkreiserfs will truncate it. -q | --quiet Sets mkreiserfs to work quietly without producing messages, progress or questions. It is useful, but only for use by end users, if you run mkreiserfs in a script. -j | --journal-device FILE FILE is the name of the block device on which is to be places the filesystem journal. -o | --journal-offset N N is the offset where the journal starts when it is to be on a separate device. Default is 0. N has no effect when the journal is to be on the host device. -s | --journal-size N N is the size of the journal in blocks. When the journal is to be on a separate device, its size defaults to the number of blocks that the device has. When journal is to be on the host device, its size defaults to 8193 and the maximal possible size is 32749 (for blocksize 4k). The minimum size is 513 blocks (whether the journal is on the host or on a separate device). -t | --transaction-max-size N N is the maximum transaction size parameter for the journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specified incorrectly, it will automatically be adjusted. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the filesystem. This list can be created by /sbin/badblocks -b block-size device. -f Forces mkreiserfs to continue even when the device is the whole disk, looks mounted, or is not a block device. If -f is specified more than once, it allows the user to avoid asking for confirmation. -d Sets mkreiserfs to print debugging information during mkreiserfs. -V Prints the version and then exits. === AUTHOR === This version of mkreiserfs has been written by Edward Shishkin <edward@namesys.com>. === BUGS === Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === [[reiserfsck|reiserfsck(8)]], [[debugreiserfs|debugreiserfs(8)]], [[reiserfstune|reiserfstune(8)]] [[category:ReiserFS]] 3b17bd9112457c0e4307875d89309b17987132ad 1326 2009-06-25T07:30:28Z Chris goe 2 http://web.archive.org/web/20061113154846/www.namesys.com/mkreiserfs.html MKREISERFS NAME SYNOPSIS DESCRIPTION OPTIONS AUTHOR BUGS SEE ALSO NAME mkreiserfs - The create tool for the Linux ReiserFS filesystem. SYNOPSIS mkreiserfs [ -dfV ] [ -b | --block-size N ] [ -h | --hash HASH ] [ -u | --uuid UUID ] [ -l | --label LABEL ] [ --format FORMAT ] [ -q | --quiet ] [ -j | --journal-device FILE ] [ -s | --journal-size N ] [ -o | --journal-offset N ] [ -t | --transaction-max-size N ] [ -B | --badblocks file ] device [ filesystem-size ] DESCRIPTION mkreiserfs creates a Linux ReiserFS filesystem on a device (usually a disk partition). device is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). filesystem-size is the size in blocks of the filesystem. If omitted, mkreiserfs will automatically set it. OPTIONS -b | --block-size N N is block size in bytes. It may only be set to a power of 2 within the 512-8192 interval. -h | --hash HASH HASH specifies which hash function will sort the names in the directories. Choose from r5, rupasov, or tea. r5 is the default one. --format FORMAT FORMAT specifies the format for the new filsystem. Choose format 3.5 or 3.6. If none is specified mkreiserfs will create format 3.6 if running kernel is 2.4 or higher, and format 3.5 if kernel 2.2 is running, and will refuse creation under all other kernels. -u | --uuid UUID Sets the Universally Unique IDentifier of the filesystem to UUID (see also uuidgen(8)). The format of the UUID is a series of hex digits separated by hypthens, e.g.: "c1b9d5a2-f162-11cf-9ece-0020afc76f16". If the option is skipped, mkreiserfs will by default generate a new UUID. -l | --label LABEL Sets the volume label of the filesystem. LABEL can at most be 16 characters long; if it is longer than 16 characters, mkreiserfs will truncate it. -q | --quiet Sets mkreiserfs to work quietly without producing messages, progress or questions. It is useful, but only for use by end users, if you run mkreiserfs in a script. -j | --journal-device FILE FILE is the name of the block device on which is to be places the filesystem journal. -o | --journal-offset N N is the offset where the journal starts when it is to be on a separate device. Default is 0. N has no effect when the journal is to be on the host device. -s | --journal-size N N is the size of the journal in blocks. When the journal is to be on a separate device, its size defaults to the number of blocks that the device has. When journal is to be on the host device, its size defaults to 8193 and the maximal possible size is 32749 (for blocksize 4k). The minimum size is 513 blocks (whether the journal is on the host or on a separate device). -t | --transaction-max-size N N is the maximum transaction size parameter for the journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specified incorrectly, it will automatically be adjusted. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the filesystem. This list can be created by /sbin/badblocks -b block-size device. -f Forces mkreiserfs to continue even when the device is the whole disk, looks mounted, or is not a block device. If -f is specified more than once, it allows the user to avoid asking for confirmation. -d Sets mkreiserfs to print debugging information during mkreiserfs. -V Prints the version and then exits. AUTHOR This version of mkreiserfs has been written by Edward Shishkin <edward@namesys.com>. BUGS Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. SEE ALSO reiserfsck(8), debugreiserfs(8), reiserfstune(8) [[category:ReiserFS]] 98f201358e127fdb3a369e34718c06acc3b273f0 Mongo 0 50 4325 4321 2019-04-16T09:04:20Z Chris goe 2 whoops Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the archive in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' == The <tt>mongo_parser.pl</tt> script == #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions, see fig.1: [[image:file_size_dist.png]]). FIGURE 1: The distribution function of the main generator <tt>determine_size()</tt>. Every this variable we get by mapping of standard gnu pseudo-random generator <tt>rand()</tt> defined on <tt>[0, RAND_MAX]</tt> onto <tt>[A, B]</tt> for suitable A,B by using high-order bits. The file sizes of first ''uniform chunk'' are in <tt>[0, F]</tt>, and <tt>P(file_size in [0, F]) = 1-gamma</tt>. The square of next ''uniform chunks'' exponentially depends on its number with exponent <tt>gamma</tt>, and the size of the stride exponentially depends on its number with exponent ''scale'' (we use <tt>scale=10</tt>). <tt>F</tt> is the range of first ''uniform chunk'' in bytes (the value of the option <tt>FILE_SIZE</tt> in <tt>mongo.pl</tt>). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of <tt>max_file_size</tt>. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the <tt>median_dir_branching</tt>. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than <tt>bytes_to_consume/10</tt>. If the <tt>maximum/median</tt> is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their <tt>median_size</tt> and <tt>max_size</tt> to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases <tt>mongo_fract_tree</tt> creates in <tt>/var/tmp</tt> a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). [https://web.archive.org/web/20061129061537/http://www.namesys.com/benchmarks/journal_relocation_to_NVRAM.html Table I. Mongo comparative results for reiserfs variations (standard journal, journal on external device (NVRAM))] (archive.org, 2006-11-29) [[category:ReiserFS]] [[category:Reiser4]] 2d93246bdfe40a1f899fe822224618a90ba04b0f 4321 1544 2019-04-16T08:57:53Z Chris goe 2 link to archive.org instead Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' == The <tt>mongo_parser.pl</tt> script == #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions, see fig.1: [[image:file_size_dist.png]]). FIGURE 1: The distribution function of the main generator <tt>determine_size()</tt>. Every this variable we get by mapping of standard gnu pseudo-random generator <tt>rand()</tt> defined on <tt>[0, RAND_MAX]</tt> onto <tt>[A, B]</tt> for suitable A,B by using high-order bits. The file sizes of first ''uniform chunk'' are in <tt>[0, F]</tt>, and <tt>P(file_size in [0, F]) = 1-gamma</tt>. The square of next ''uniform chunks'' exponentially depends on its number with exponent <tt>gamma</tt>, and the size of the stride exponentially depends on its number with exponent ''scale'' (we use <tt>scale=10</tt>). <tt>F</tt> is the range of first ''uniform chunk'' in bytes (the value of the option <tt>FILE_SIZE</tt> in <tt>mongo.pl</tt>). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of <tt>max_file_size</tt>. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the <tt>median_dir_branching</tt>. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than <tt>bytes_to_consume/10</tt>. If the <tt>maximum/median</tt> is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their <tt>median_size</tt> and <tt>max_size</tt> to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases <tt>mongo_fract_tree</tt> creates in <tt>/var/tmp</tt> a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). [https://web.archive.org/web/20061129061537/http://www.namesys.com/benchmarks/journal_relocation_to_NVRAM.html Table I. Mongo comparative results for reiserfs variations (standard journal, journal on external device (NVRAM))] (archive.org, 2006-11-29) [[category:ReiserFS]] [[category:Reiser4]] 46459899b61450d5882a686d774aa37436b24b86 1544 1517 2009-07-02T19:09:03Z Chris goe 2 file_size_dist.png included Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' == The <tt>mongo_parser.pl</tt> script == #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions, see fig.1: [[image:file_size_dist.png]]). FIGURE 1: The distribution function of the main generator <tt>determine_size()</tt>. Every this variable we get by mapping of standard gnu pseudo-random generator <tt>rand()</tt> defined on <tt>[0, RAND_MAX]</tt> onto <tt>[A, B]</tt> for suitable A,B by using high-order bits. The file sizes of first ''uniform chunk'' are in <tt>[0, F]</tt>, and <tt>P(file_size in [0, F]) = 1-gamma</tt>. The square of next ''uniform chunks'' exponentially depends on its number with exponent <tt>gamma</tt>, and the size of the stride exponentially depends on its number with exponent ''scale'' (we use <tt>scale=10</tt>). <tt>F</tt> is the range of first ''uniform chunk'' in bytes (the value of the option <tt>FILE_SIZE</tt> in <tt>mongo.pl</tt>). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of <tt>max_file_size</tt>. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the <tt>median_dir_branching</tt>. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than <tt>bytes_to_consume/10</tt>. If the <tt>maximum/median</tt> is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their <tt>median_size</tt> and <tt>max_size</tt> to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases <tt>mongo_fract_tree</tt> creates in <tt>/var/tmp</tt> a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). [[Mongo/journal_relocation_to_NVRAM|Here]] is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] ed631743fc784eb80b89507a68a6e4334c61a2ed 1517 1515 2009-06-27T19:13:12Z Chris goe 2 new url Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/thebsh/benchmarks/dist/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' == The <tt>mongo_parser.pl</tt> script == #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see [http://nerdbynature.de/bits/thebsh/benchmarks/file_size_dist.jpg fig.1]). [http://nerdbynature.de/bits/thebsh/benchmarks/file_size_dist.jpg FIGURE 1]: The distribution function of the main generator <tt>determine_size()</tt>. Every this variable we get by mapping of standard gnu pseudo-random generator <tt>rand()</tt> defined on <tt>[0, RAND_MAX]</tt> onto <tt>[A, B]</tt> for suitable A,B by using high-order bits. The file sizes of first ''uniform chunk'' are in <tt>[0, F]</tt>, and <tt>P(file_size in [0, F]) = 1-gamma</tt>. The square of next ''uniform chunks'' exponentially depends on its number with exponent <tt>gamma</tt>, and the size of the stride exponentially depends on its number with exponent ''scale'' (we use <tt>scale=10</tt>). <tt>F</tt> is the range of first ''uniform chunk'' in bytes (the value of the option <tt>FILE_SIZE</tt> in <tt>mongo.pl</tt>). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of <tt>max_file_size</tt>. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the <tt>median_dir_branching</tt>. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than <tt>bytes_to_consume/10</tt>. If the <tt>maximum/median</tt> is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their <tt>median_size</tt> and <tt>max_size</tt> to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases <tt>mongo_fract_tree</tt> creates in <tt>/var/tmp</tt> a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). [[Mongo/journal_relocation_to_NVRAM|Here]] is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] 528583227f9006cd745621fcf666a5d9a2c01f17 1515 1514 2009-06-27T18:50:51Z Chris goe 2 /* Mongo output */ Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/mongo/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' == The <tt>mongo_parser.pl</tt> script == #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see [http://nerdbynature.de/bits/mongo/file_size_dist.jpg fig.1]). [http://nerdbynature.de/bits/mongo/file_size_dist.jpg FIGURE 1]: The distribution function of the main generator <tt>determine_size()</tt>. Every this variable we get by mapping of standard gnu pseudo-random generator <tt>rand()</tt> defined on <tt>[0, RAND_MAX]</tt> onto <tt>[A, B]</tt> for suitable A,B by using high-order bits. The file sizes of first ''uniform chunk'' are in <tt>[0, F]</tt>, and <tt>P(file_size in [0, F]) = 1-gamma</tt>. The square of next ''uniform chunks'' exponentially depends on its number with exponent <tt>gamma</tt>, and the size of the stride exponentially depends on its number with exponent ''scale'' (we use <tt>scale=10</tt>). <tt>F</tt> is the range of first ''uniform chunk'' in bytes (the value of the option <tt>FILE_SIZE</tt> in <tt>mongo.pl</tt>). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of <tt>max_file_size</tt>. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the <tt>median_dir_branching</tt>. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than <tt>bytes_to_consume/10</tt>. If the <tt>maximum/median</tt> is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their <tt>median_size</tt> and <tt>max_size</tt> to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases <tt>mongo_fract_tree</tt> creates in <tt>/var/tmp</tt> a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). [[Mongo/journal_relocation_to_NVRAM|Here]] is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] fea0aac3d8e542cb537a862d178cf43c4b70a9d5 1514 1513 2009-06-27T18:50:24Z Chris goe 2 http://web.archive.org/web/20061113154921/http://www.namesys.com/benchmarks/journal_relocation_to_NVRAM.html Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/mongo/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' == The <tt>mongo_parser.pl</tt> script == #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see [http://nerdbynature.de/bits/mongo/file_size_dist.jpg fig.1]). [http://nerdbynature.de/bits/mongo/file_size_dist.jpg FIGURE 1]: The distribution function of the main generator <tt>determine_size()</tt>. Every this variable we get by mapping of standard gnu pseudo-random generator <tt>rand()</tt> defined on <tt>[0, RAND_MAX]</tt> onto <tt>[A, B]</tt> for suitable A,B by using high-order bits. The file sizes of first ''uniform chunk'' are in <tt>[0, F]</tt>, and <tt>P(file_size in [0, F]) = 1-gamma</tt>. The square of next ''uniform chunks'' exponentially depends on its number with exponent <tt>gamma</tt>, and the size of the stride exponentially depends on its number with exponent ''scale'' (we use <tt>scale=10</tt>). <tt>F</tt> is the range of first ''uniform chunk'' in bytes (the value of the option <tt>FILE_SIZE</tt> in <tt>mongo.pl</tt>). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of <tt>max_file_size</tt>. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the <tt>median_dir_branching</tt>. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than <tt>bytes_to_consume/10</tt>. If the <tt>maximum/median</tt> is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their <tt>median_size</tt> and <tt>max_size</tt> to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases <tt>mongo_fract_tree</tt> creates in <tt>/var/tmp</tt> a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). [[Mongo/journal_relocation_to_NVRAM|Here] is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] 22fcefa588e69cc404b45463913b6c653e9618cb 1513 1512 2009-06-27T18:48:21Z Chris goe 2 formatting fixes Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/mongo/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' == The <tt>mongo_parser.pl</tt> script == #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see [http://nerdbynature.de/bits/mongo/file_size_dist.jpg fig.1]). [http://nerdbynature.de/bits/mongo/file_size_dist.jpg FIGURE 1]: The distribution function of the main generator <tt>determine_size()</tt>. Every this variable we get by mapping of standard gnu pseudo-random generator <tt>rand()</tt> defined on <tt>[0, RAND_MAX]</tt> onto <tt>[A, B]</tt> for suitable A,B by using high-order bits. The file sizes of first ''uniform chunk'' are in <tt>[0, F]</tt>, and <tt>P(file_size in [0, F]) = 1-gamma</tt>. The square of next ''uniform chunks'' exponentially depends on its number with exponent <tt>gamma</tt>, and the size of the stride exponentially depends on its number with exponent ''scale'' (we use <tt>scale=10</tt>). <tt>F</tt> is the range of first ''uniform chunk'' in bytes (the value of the option <tt>FILE_SIZE</tt> in <tt>mongo.pl</tt>). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of <tt>max_file_size</tt>. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the <tt>median_dir_branching</tt>. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than <tt>bytes_to_consume/10</tt>. If the <tt>maximum/median</tt> is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their <tt>median_size</tt> and <tt>max_size</tt> to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases <tt>mongo_fract_tree</tt> creates in <tt>/var/tmp</tt> a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). Here is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] d9d7df230522115362ffe4a57ab16024fc58b8ab 1512 1511 2009-06-27T18:43:42Z Chris goe 2 paragraph added Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/mongo/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' == The <tt>mongo_parser.pl</tt> script == #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see fig.1). FIGURE 1. The distribution function of the main generator determine_size(). Every this variable we get by mapping of standard gnu pseudo-random generator rand() defined on [0, RAND_MAX] onto [A, B] for suitable A,B by using high-order bits. The file sizes of first 'uniform chunk' are in [0, F], and P(file_size in [0, F]) = 1 - gamma. The square of next 'uniform chunks' exponentially depends on its number with exponent 'gamma', and the size of the stride exponentially depends on its number with exponent 'scale' (we use scale = 10). F is the range of first 'uniform chunk' in bytes (the value of the option FILE_SIZE in mongo.pl). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of max_file_size. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the median_dir_branching. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than bytes_to_consume/10. If the maximum/median is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their median_size and max_size to both equal the max size of the file size range you want to test. <tt>In order to provide the same conditions for various testing file systems in next phases mongo_fract_tree</tt> creates in /var/tmp a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). Here is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] ab96b5a6870dafbdb7888e519bcbac4826af4150 1511 1510 2009-06-27T18:42:11Z Chris goe 2 formatting fixes Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/mongo/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). '''WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory.''' The <tt>mongo_parser.pl</tt> script: #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see fig.1). FIGURE 1. The distribution function of the main generator determine_size(). Every this variable we get by mapping of standard gnu pseudo-random generator rand() defined on [0, RAND_MAX] onto [A, B] for suitable A,B by using high-order bits. The file sizes of first 'uniform chunk' are in [0, F], and P(file_size in [0, F]) = 1 - gamma. The square of next 'uniform chunks' exponentially depends on its number with exponent 'gamma', and the size of the stride exponentially depends on its number with exponent 'scale' (we use scale = 10). F is the range of first 'uniform chunk' in bytes (the value of the option FILE_SIZE in mongo.pl). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of max_file_size. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the median_dir_branching. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than bytes_to_consume/10. If the maximum/median is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their median_size and max_size to both equal the max size of the file size range you want to test. <tt>In order to provide the same conditions for various testing file systems in next phases mongo_fract_tree</tt> creates in /var/tmp a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). Here is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] 39cbc6db1f29ba877fadf97c319a5f9ace98514b 1510 1509 2009-06-27T18:40:19Z Chris goe 2 phase linked Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/mongo/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for [[#Copy phase|copy phase]]. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory. The <tt>mongo_parser.pl</tt> script: #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see fig.1). FIGURE 1. The distribution function of the main generator determine_size(). Every this variable we get by mapping of standard gnu pseudo-random generator rand() defined on [0, RAND_MAX] onto [A, B] for suitable A,B by using high-order bits. The file sizes of first 'uniform chunk' are in [0, F], and P(file_size in [0, F]) = 1 - gamma. The square of next 'uniform chunks' exponentially depends on its number with exponent 'gamma', and the size of the stride exponentially depends on its number with exponent 'scale' (we use scale = 10). F is the range of first 'uniform chunk' in bytes (the value of the option FILE_SIZE in mongo.pl). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of max_file_size. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the median_dir_branching. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than bytes_to_consume/10. If the maximum/median is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their median_size and max_size to both equal the max size of the file size range you want to test. <tt>In order to provide the same conditions for various testing file systems in next phases mongo_fract_tree</tt> creates in /var/tmp a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). Here is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] a816c211ddc40a5a84a30aae810d2421b4a48fde 1509 1508 2009-06-27T18:39:36Z Chris goe 2 minor formatting fixes, archive url changed Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://nerdbynature.de/bits/mongo/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). * GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for [[#Create phase|create phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_COPY - setting for copy phase. This option requires one of the following values: <tt>off/cp/list"</tt>. In "cp" mode <tt>cp(1)</tt> is invoked to copy files. In "list" mode (deafult) uses [[#Copy phase|mongo_copy]] to copy files. See <tt>mongo_copy.c</tt> for details. * PHASE_APPEND - setting for [[#Append phase|append phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_MODIFY - setting for [[#Modify phase|modify phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_OVERWRITE - setting for [[#Overwrite phase|overwrite phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_READ - setting for [[#Read phase|read phase]]. The required values are <tt>off/find/list</tt>. In <tt>"find"</tt> mode, <tt>find(1)</tt> is used to read the files. In <tt>"list"</tt> mode (deafult) [[#Read phase|mongo_read]] is used. See <tt>mongo_read.c</tt> for details. * PHASE_STATS - setting for [[#Stats phase|stats phase]]: <tt>on/off</tt> (<tt>on</tt> by default). * PHASE_DELETE - setting for [[#Delete phase|delete phase]]. This option requires one of the following values: <tt>off/rm/list</tt>. In <tt>"cp"</tt> mode <tt>rm(1)</tt> is used to delete the working file set. In <tt>"list"</tt> mode (deafult) [[#Delete phase|mongo_delete]] is used. See <tt>mongo_delete.c</tt> for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the <tt>mongo.pl</tt> session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory. The <tt>mongo_parser.pl</tt> script: #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see fig.1). FIGURE 1. The distribution function of the main generator determine_size(). Every this variable we get by mapping of standard gnu pseudo-random generator rand() defined on [0, RAND_MAX] onto [A, B] for suitable A,B by using high-order bits. The file sizes of first 'uniform chunk' are in [0, F], and P(file_size in [0, F]) = 1 - gamma. The square of next 'uniform chunks' exponentially depends on its number with exponent 'gamma', and the size of the stride exponentially depends on its number with exponent 'scale' (we use scale = 10). F is the range of first 'uniform chunk' in bytes (the value of the option FILE_SIZE in mongo.pl). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of max_file_size. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the median_dir_branching. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than bytes_to_consume/10. If the maximum/median is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their median_size and max_size to both equal the max size of the file size range you want to test. <tt>In order to provide the same conditions for various testing file systems in next phases mongo_fract_tree</tt> creates in /var/tmp a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). Here is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] cbe8657df0064032b138dca495cbcfddb8da542d 1508 1496 2009-06-27T17:49:59Z Chris goe 2 formatting fixes Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://web.archive.org/web/20061115034149/thebsh.namesys.com/benchmarks/dist/ archive] in a directory and read the Introduction to Mongo Testsuites. == Introduction to the Mongo Testsuites == Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. == The <tt>mongo.pl</tt> script == # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: === Required mongo options === * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. === Other mongo options === * MKFS - path to the executable file that creates testing filesystem (e.g. [[mkreiserfs]]). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of [[mount|mount options]] separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). ( GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default [[mkreiserfs]] creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by <tt>dd(1)</tt> program. If this option specified mongo executes two special phases <tt>dd_reading_largefile</tt> and <tt>dd_writing_largefile</tt> (see [[#What is Mongo doing?|mongo phases description]] below). === Special options === * LOG - the name of the file where you wish to store statistics result tree that <tt>mongo.pl</tt> creates for each mongo run (see below). Regardless of this option, <tt>mongo.pl</tt> writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by <tt>mongo_parser.pl</tt> script. * INFO_R4 - string information the benchmarked [[Reiser4]] version about. This is required option if <tt>FSTYPE=reiser4</tt> is set. === Mongo phases settings options === (see [[#What is Mongo doing?|mongo phases description]] below). * PHASE_CREATE - setting for create phase : "on"/"off" ("on" by default). * PHASE_COPY - setting for copy phase. This option requires one of the following values: "off"/"cp"/"list". In "cp" mode cp(1) is invoked to copy files. In "list" mode (deafult) uses mongo_copy to copy files. See mongo_copy.c for details. * PHASE_APPEND - setting for append phase : "on"/"off" ("on" by default). * PHASE_MODIFY - setting for modify phase : "on"/"off" ("on" by default). * PHASE_OVERWRITE - setting for overwrite phase : "on"/"off" ("on" by default). * PHASE_READ - setting for read phase. The required values are "off"/"find"/"list". In "find" mode find(1) is used to read the files. In "list" mode (deafult) mongo_read is used. See mongo_read.c for details. * PHASE_STATS - setting for stats phase : "on"/"off" ("on" by default). * PHASE_DELETE - setting for delete phase. This option requires one of the following values: "off"/"rm"/"list". In "cp" mode rm(1) is used to delete the working file set. In "list" mode (deafult) mongo_delete is used. See mongo_delete.c for details. === Special required command === * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see [[#What is Mongo doing?|below]]) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the mongo.pl session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file <tt>"mongo.opts"</tt>: # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). WARNING: <tt>mongo.pl</tt> will format each specified device DEV by <tt>mkfs.xxx</tt> and mount it at MNT directory. The <tt>mongo_parser.pl</tt> script: #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where <tt>log1, log2, log3, ...</tt> are names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. '''WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic.''' Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size): ./mongo.pl log=log1 file_size=10000 bytes=10000000 \ fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN \ log=log2 mount_options=notail RUN log=log3 file_size=20000 <tt>mongo_parser.pl</tt> creates a comparative html-table of specified result trees. == What is Mongo doing? == For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV. For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: === Create phase === The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) # ./reiser_fract_tree <bytes_to_consume> <median_file_size> \ <max_file_size> <median_dir_nr_files> <max_directory_nr_files> \ <median_dir_branching> <max_dir_branching> <write_buffer_size> \ <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> \ <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (<tt>off_t determine_size( off_t F, off_t max_size)</tt>) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see fig.1). FIGURE 1. The distribution function of the main generator determine_size(). Every this variable we get by mapping of standard gnu pseudo-random generator rand() defined on [0, RAND_MAX] onto [A, B] for suitable A,B by using high-order bits. The file sizes of first 'uniform chunk' are in [0, F], and P(file_size in [0, F]) = 1 - gamma. The square of next 'uniform chunks' exponentially depends on its number with exponent 'gamma', and the size of the stride exponentially depends on its number with exponent 'scale' (we use scale = 10). F is the range of first 'uniform chunk' in bytes (the value of the option FILE_SIZE in mongo.pl). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of max_file_size. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the median_dir_branching. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than bytes_to_consume/10. If the maximum/median is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their median_size and max_size to both equal the max size of the file size range you want to test. <tt>In order to provide the same conditions for various testing file systems in next phases mongo_fract_tree</tt> creates in /var/tmp a list of all files sorted in the order they were created in. === Copy phase === <tt>mongo_copy()</tt> program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: # ./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> === Append phase === The <tt>mongo_append</tt> program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: # ./mongo_append <append_factor> <writebuffer_size> <sync_flag> === Modify phase === The <tt>mongo_modify</tt> program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: # ./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> === Overwrite phase === This phase uses <tt>mongo_modify</tt> program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe <tt>fsync()</tt>) it. === Read phase === The <tt>mongo_read</tt> program reads files created by mongo_fract_tree in specified order. === Stats phase === We do <tt>find -type f</tt> on the expected partition. Zam believes that it should be enough for stat for all files. === Delete phase === We do <tt>"rm -r"</tt> on all files and directories. === dd_writing_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT"</tt>. === dd_reading_largefile phase === This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do <tt>dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT"</tt>. Look at the source code if you need more information than this introduction contains. == Mongo output == The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is a "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations <tt>mongo.pl</tt> prepares statistics for one or more phase variations. <tt>mongo_parser.pl</tt> script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using <tt>mongo.pl</tt> script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for <tt>mongo_parser.pl</tt> (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in <tt>mongo_parser.pl</tt>. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). Here is an example of Mongo comparative table. [[category:ReiserFS]] [[category:Reiser4]] c03a0d61addae634e0e0e2ed08f815e38e3c3a55 1496 1495 2009-06-27T09:51:42Z Chris goe 2 formatting fixes Mongo is the main benchmark script we use for comparing [[ReiserFS]] variations. Untar the [http://web.archive.org/web/20061115034149/thebsh.namesys.com/benchmarks/dist/ archive] in a directory and read the Introduction to Mongo Testsuites. === Introduction to the Mongo Testsuites === Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is <tt>mongo.pl</tt> script which creates the set of statistics for the file system variations specified by special mongo options. The <tt>mongo_parser.pl</tt> script parses those statistics and creates for them comparative html-table. The <tt>mongo.pl</tt> script: # ./mongo.pl opt11=val11 opt12=val12 ... \ RUN [opt21=val21 opt22=val22 ... \ RUN opt31=val31 opt32=val32 ... \ RUN ... | @<file to include>], where <tt>opt1j</tt> (j = 1, 2, ...) are required and maybe another mongo options, <tt>optij</tt> (i = 2, ...; j = 1, 2, ...) - mongo options. The expression <tt>optij=valij</tt> means that mongo option <tt>optij</tt> was specified by the value <tt>valij</tt>. Here is a description of acceptable values of all mongo options: Required mongo options: * FSTYPE - filesystem type (e.g. ext3) * DEV - device file name (e.g. /dev/hda9) * DIR - mount-point for the filesystem (e.g. /mnt/testfs) * FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). * BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: <tt>BYTES * REP_COUNTER > ramsize</tt>. Other mongo options: * MKFS - path to the executable file that creates testing filesystem (e.g. /sbin/mkreiserfs). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command <tt>mkfs.''filesystem_name''</tt>, so make sure it is available. * MOUNT_OPTIONS - list of mount options separated as usual by commas (e.g. rw,notail). * NPROC - number of processes running simultaneously (3 by default). * REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. * SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. * WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). ( GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). * JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default mkreiserfs creates journal on main device (DEV). * JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default mkreiserfs creates journal of standard size (8193). * DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by "dd" program. If this option specified mongo executes two special phases "dd_reading_largefile" and "dd_writing_largefile" (see mongo phases description below). Special options: * LOG - the name of the file where you wish to store statistics result tree that mongo.pl creates for each mongo run (see below). Regardless of this option, mongo.pl writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by mongo_parser.pl script. * INFO_R4 - string information the benchmarked reiser4 version about. This is required option if FSTYPE=reiser4 is set. Mongo phases settings options (see mongo phases description below). * PHASE_CREATE - setting for create phase : "on"/"off" ("on" by default). * PHASE_COPY - setting for copy phase. This option requires one of the following values: "off"/"cp"/"list". In "cp" mode cp(1) is invoked to copy files. In "list" mode (deafult) uses mongo_copy to copy files. See mongo_copy.c for details. * PHASE_APPEND - setting for append phase : "on"/"off" ("on" by default). PHASE_MODIFY - setting for modify phase : "on"/"off" ("on" by default). * PHASE_OVERWRITE - setting for overwrite phase : "on"/"off" ("on" by default). * PHASE_READ - setting for read phase. The required values are "off"/"find"/"list". In "find" mode find(1) is used to read the files. In "list" mode (deafult) mongo_read is used. See mongo_read.c for details. * PHASE_STATS - setting for stats phase : "on"/"off" ("on" by default). * PHASE_DELETE - setting for delete phase. This option requires one of the following values: "off"/"rm"/"list". In "cp" mode rm(1) is used to delete the working file set. In "list" mode (deafult) mongo_delete is used. See mongo_delete.c for details. Special required command: * RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see below) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the mongo.pl session unless you respecify another ones. Example: # ./mongo.pl LOG=/tmp/logfile1 file_size=10000 \ bytes=10000000 fstype=reiserfs dev=/dev/hda9 \ dir=/mnt/testfs RUN log=/tmp/logfile2 \ mount_options=notail RUN * <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file "mongo.opts": # ./mongo.pl log=/tmp/logfile1 @mongo.opts \ log=/tmp/logfile2 mount_options=notail RUN <tt>mongo.pl</tt> executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). WARNING: <tt>mongo.pl</tt> will format each specified device DEV by 'mkfs.xxx' and mount it at MNT directory. The <tt>mongo_parser.pl</tt> script: #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where log1, log2, log3, ... - names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic. Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size).: ./mongo.pl log=log1 file_size=10000 bytes=10000000 fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN log=log2 mount_options=notail RUN log=log3 file_size=20000 mongo_parser.pl creates a comparative html-table of specified result trees. What is Mongo doing? For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV . For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. # Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: Create phase: The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) #./reiser_fract_tree <bytes_to_consume> <median_file_size> <max_file_size> <median_dir_nr_files> <max_directory_nr_files> <median_dir_branching> <max_dir_branching> <write_buffer_size> <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (off_t determine_size( off_t F, off_t max_size )) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see fig.1). FIGURE 1. The distribution function of the main generator determine_size(). Every this variable we get by mapping of standard gnu pseudo-random generator rand() defined on [0, RAND_MAX] onto [A, B] for suitable A,B by using high-order bits. The file sizes of first 'uniform chunk' are in [0, F], and P(file_size in [0, F]) = 1 - gamma. The square of next 'uniform chunks' exponentially depends on its number with exponent 'gamma', and the size of the stride exponentially depends on its number with exponent 'scale' (we use scale = 10). F is the range of first 'uniform chunk' in bytes (the value of the option FILE_SIZE in mongo.pl). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of max_file_size. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the median_dir_branching. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than bytes_to_consume/10. If the maximum/median is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their median_size and max_size to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases mongo_fract_tree creates in /var/tmp a list of all files sorted in the order they were created in. # Copy phase: mongo_copy() program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: #./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> # Append phase: The mongo_append program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: #./mongo_append <append_factor> <writebuffer_size> <sync_flag> # Modify phase: The mongo_modify program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: #./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> # Overwrite phase: This phase uses mongo_modify program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe fsync) it. # Read phase: The mongo_read program reads files created by mongo_fract_tree in specified order. # Stats phase: We do "find -type f" on the expected partition. Zam believes that it should be enough for stat for all files. # Delete phase: We do "rm -r" on all files and directories. # dd_writing_largefile phase: This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do "dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT". # dd_reading_largefile phase: This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do "dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT". Look at the source code if you need more information than this introduction contains. Mongo output. The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations mongo.pl prepares statistics for one or more phase variations. mongo_parser.pl script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using mongo.pl script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for mongo_parser.pl (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in mongo_parser.pl. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). Here is an example of Mongo comparative table. maintained by edward@namesys.com Last modified: Fri Dec 20 13:32:25 MSK 2002 [[category:ReiserFS]] [[category:Reiser4]] ecb8b370e7cf1385cc661a074573cc980116faaf 1495 2009-06-27T09:42:34Z Chris goe 2 http://web.archive.org/web/20061113154921/www.namesys.com/benchmarks/mongo_readme.html Mongo - The Main Benchmark Script We Use For Comparing ReiserFS Variations Mongo sources you can get here. Untar the archive in any directory and read the Introduction to Mongo Testsuites. Introduction to the Mongo Testsuites Mongo is a set of the programs to test linux filesystems for performance and functionality. The main program is mongo.pl script which creates the set of statistics for the file system variations specified by special mongo options. The mongo_parser.pl script parses those statistics and creates for them comparative html-table. The mongo.pl script: #./mongo.pl opt11=val11 opt12=val12 ... RUN [opt21=val21 opt22=val22 ... RUN opt31=val31 opt32=val32 ... RUN ... | @<file to include>], where opt1j (j = 1, 2, ...) are required and maybe another mongo options, optij (i = 2, ...; j = 1, 2, ...) - mongo options. The expression "optij=valij" means that mongo option optij was specified by the value valij. Here is a description of acceptable values of all mongo options: Required mongo options: FSTYPE - filesystem type (e.g. ext3) DEV - device file name (e.g. /dev/hda9) DIR - mount-point for the filesystem (e.g. /mnt/testfs) FILE_SIZE - file size in bytes (e.g. 10000) used in reiser_fract_tree, this is passed to the main generator function determine_size() (see below). BYTES - file set size in bytes (e.g. 250000000) created by all instances of reiser_fract_tree in one pass. To have results free from buffer cache influence, it has to satisfy to the property: BYTES * REP_COUNTER > ramsize. Other mongo options: MKFS - path to the executable file that creates testing filesystem (e.g. /sbin/mkreiserfs). By default (if it is not reiserfs or ext2) mongo.pl tries to create it by the command "mkfs.filesystem_name", so make sure it is available. MOUNT_OPTIONS - list of mount options separated as usual by commas (e.g. rw,notail). NPROC - number of processes running simultaneously (3 by default). REP_COUNTER - number of passes of each mongo phase (3 by default). Each mongo statistics is an average value of REP_COUNTER results. So using REP_COUNTER > 1 reduces dispersion and improves mongo statistics. SYNC - this option requires one of two strings :"on"/"off" ("off" by default). "on" means forcing of syncing to iozone of regular files in create, copy, append, modify phases. WRITE_BUFFER - read/write buffer size in bytes for mongo utilities (4096 by default). GAMMA - the exponent of the core file size distribution of the random value generator determine_size() used in mongo_fract_tree() (see below). GAMMA values are in [0,1] (e.g. 0.2, default value is 0.0). JOURNAL_DEV - journal device name. This is an option only for reiserfs with non-standard journal support. By default mkreiserfs creates journal on main device (DEV). JOURNAL_SIZE - journal size in blocks including journal header (e.g. 513). This is an option only for reiserfs with non-standard journal support. By default mkreiserfs creates journal of standard size (8193). DD_MBCOUNT - size in megabytes of the large file that we want to read (write) by "dd" program. If this option specified mongo executes two special phases "dd_reading_largefile" and "dd_writing_largefile" (see mongo phases description below). Special options: LOG - the name of the file where you wish to store statistics result tree that mongo.pl creates for each mongo run (see below). Regardless of this option, mongo.pl writes all the results into stdout, but we recommend specify it for each file system variations you want to compare, as it will enable you to create comparative html-table by mongo_parser.pl script. INFO_R4 - string information the benchmarked reiser4 version about. This is required option if FSTYPE=reiser4 is set. Mongo phases settings options (see mongo phases description below). PHASE_CREATE - setting for create phase : "on"/"off" ("on" by default). PHASE_COPY - setting for copy phase. This option requires one of the following values : "off"/"cp"/"list". In "cp" mode cp(1) is invoked to copy files. In "list" mode (deafult) uses mongo_copy to copy files. See mongo_copy.c for details. PHASE_APPEND - setting for append phase : "on"/"off" ("on" by default). PHASE_MODIFY - setting for modify phase : "on"/"off" ("on" by default). PHASE_OVERWRITE - setting for overwrite phase : "on"/"off" ("on" by default). PHASE_READ - setting for read phase. The required values are "off"/"find"/"list". In "find" mode find(1) is used to read the files. In "list" mode (deafult) mongo_read is used. See mongo_read.c for details. PHASE_STATS - setting for stats phase : "on"/"off" ("on" by default). PHASE_DELETE - setting for delete phase. This option requires one of the following values : "off"/"rm"/"list". In "cp" mode rm(1) is used to delete the working file set. In "list" mode (deafult) mongo_delete is used. See mongo_delete.c for details. Special required command: RUN - defines one mongo run (while the whole string defines one mongo session) which starts all default and maybe some special mongo phases (see below) defined by the options specified before this command. The mongo options keep its values (specified or default) during all the mongo.pl session unless you respecify another ones. Example: ./mongo.pl LOG=/tmp/logfile1 file_size=10000 bytes=10000000 fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN log=/tmp/logfile2 mount_options=notail RUN <file_to_include> - We recommend to specify all the mongo options you want in one file instead of command string, since to edit a file is more convenient then the command string. Each specification must occupy one string in this file. For example, previous command can be rewritten if you place all the options with first "RUN" in the file "mongo.opts": ./mongo.pl log=/tmp/logfile1 @mongo.opts log=/tmp/logfile2 mount_options=notail RUN mongo.pl executes one or more mongo runs defined by specified options. For each run mongo.pl creates the tree of mongo statistics (statistics result tree). WARNING: mongo.pl will format each specified device DEV by 'mkfs.xxx' and mount it at MNT directory. The mongo_parser.pl script: #./mongo_parser.pl log1 [log2 log3 ...] > comparative_table.html where log1, log2, log3, ... - names of the files which contains statistics result trees created by mongo.pl. Each those file should contain only one statistics result tree. WARNING: The result trees of all specified files file1, file2, file3, ... must be mutually phase-isomorphic. Example: The result trees of logfile1, logfile2 from the example above are phase-isomorphic. On the other hand, specifying of log1, log2, log3 from the following example is not available, since the result trees of log2, log3 are non-isomorphic (different file_size).: ./mongo.pl log=log1 file_size=10000 bytes=10000000 fstype=reiserfs dev=/dev/hda9 dir=/mnt/testfs RUN log=log2 mount_options=notail RUN log=log3 file_size=20000 mongo_parser.pl creates a comparative html-table of specified result trees. What is Mongo doing? For each run Mongo executes 8 default, and maybe some special phases. In each phase Mongo runs NPROC processes (the parent one with (NPROC - 1) children) defined by appropriate mongo utility and creates the set of mongo statistics. Currently mongo supports three kind of statistics: REAL_TIME, CPU_TIME, and DF. REAL_TIME and CPU_TIME are timing statistics about the run of the specified number (NPROC) of processes of appropriate phase. REAL_TIME is the elapsed real (in seconds) time between invocation and termination. CPU_TIME is the system CPU time (in CPU-seconds) - the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2). DF is space usage statistic of the specified device DEV . For default phases DF means disc space usage in bytes after all the previous phases including the current one. For the special dd_writing_largefile, dd_reading_largefile phases DF means the size in bytes of the file created during appropriate phases. The default mongo phases model the basic user's processes which use file API. In order to run special mongo phases you should specify special phase-options. # Currently Mongo supports 8 default and 2 special phases. Each phase defined by appropriate mongo utility: Create phase: The reiser_fract_tree program creates files in a tree of random depth and branching (maybe fsync each files) #./reiser_fract_tree <bytes_to_consume> <median_file_size> <max_file_size> <median_dir_nr_files> <max_directory_nr_files> <median_dir_branching> <max_dir_branching> <write_buffer_size> <testfs_mount_point> <print_stats_flag> <max_fname> <flist_name> <sync_flag> <gamma_exponent> Files vary in size randomly according to the core file size generator (off_t determine_size( off_t F, off_t max_size )) used in reiser_fract_tree. This generator is constructed by random variables that have uniform distributions (see fig.1). FIGURE 1. The distribution function of the main generator determine_size(). Every this variable we get by mapping of standard gnu pseudo-random generator rand() defined on [0, RAND_MAX] onto [A, B] for suitable A,B by using high-order bits. The file sizes of first 'uniform chunk' are in [0, F], and P(file_size in [0, F]) = 1 - gamma. The square of next 'uniform chunks' exponentially depends on its number with exponent 'gamma', and the size of the stride exponentially depends on its number with exponent 'scale' (we use scale = 10). F is the range of first 'uniform chunk' in bytes (the value of the option FILE_SIZE in mongo.pl). Median file size is hypothesized to be proportional to the average per file space wastage. Notice how that implies that, with a more efficient filesystem, file size usage patterns will in the long term move to a lower median file size.) It has a maximum size of max_file_size. Directories vary in size according to the same distribution function, but with separate parameters to control both the median and maximum size for the number of files within them, and the number of subdirectories within them. This program prunes some empty subdirectories in a manner that causes the parents of leaf directories to branch less than the median_dir_branching. To avoid having one large file distort the results such that you have to benchmark many times, set max_file_size to not more than bytes_to_consume/10. If the maximum/median is a small integer, then randomness will be very poor. For isolating the performance consequences of design variations on particular file or directory size ranges, try setting their median_size and max_size to both equal the max size of the file size range you want to test. In order to provide the same conditions for various testing file systems in next phases mongo_fract_tree creates in /var/tmp a list of all files sorted in the order they were created in. # Copy phase: mongo_copy() program copies files created by reiser_fract_tree in specified order (maybe fsync each new file). The order and the files specified by flist: #./mongo_copy <source_dir> <dest_dir> <writebuffer_size> <flist> <sync_flag> # Append phase: The mongo_append program reads filenames from stdin and appends to each file (filesize * append_factor) bytes, and maybe fsync it: #./mongo_append <append_factor> <writebuffer_size> <sync_flag> # Modify phase: The mongo_modify program reads filenames from stdin and modifies its (filesize * modify_factor) bytes starting with random position, and maybe fsync it: #./mongo_modify <modify_factor> <writebuffer_size> <sync_flag> # Overwrite phase: This phase uses mongo_modify program with modify_factor = 1, so it modifies filesize bytes, i.e. overwrites (and maybe fsync) it. # Read phase: The mongo_read program reads files created by mongo_fract_tree in specified order. # Stats phase: We do "find -type f" on the expected partition. Zam believes that it should be enough for stat for all files. # Delete phase: We do "rm -r" on all files and directories. # dd_writing_largefile phase: This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do "dd if=/dev/zero of=DIR/largefile bs=1M count=DD_MBCOUNT". # dd_reading_largefile phase: This is a special mongo phase which requires the option DD_MBCOUNT to be specified. We do "dd if=DIR/largefile of=/dev/null bs=1M count=DD_MBCOUNT". Look at the source code if you need more information than this introduction contains. Mongo output. The main purpose of Mongo is comparing of file system variations. The following mongo options (fs-options) are to specify these variations: SYSTEM, FSTYPE, DEV, DIR, MOUNT_OPTIONS, SYNC, JOURNAL_DEV, JOURNAL_SIZE. Note, that SYSTEM is "fake" fs-option which means the kernel version that the mongo was run under. For example: SYSTEM = linux-2.4.19-rc1+01-relocation4.patch+02-commit_super-8-relocation.patch+03-data-logging-24.patch. For the same file system variation Mongo define one or more phase variations by following mongo phase-options: REP_COUNTER, WRITE_BUFFER, GAMMA, FILE_SIZE, BYTES, DD_MBCOUNT. These options specify the parameters which are passed to the mongo utilities. For each file system variations mongo.pl prepares statistics for one or more phase variations. mongo_parser.pl script is to prepare comparative table for one or more file system variations. First you should prepare the appropriate mongo output files for each variation by using mongo.pl script. Make sure that all these files contain phase-isomorphic result trees (see above). Then specify these filenames for mongo_parser.pl (see the usage above) which will create comparative html table (by default in stdout). The file system variations are represented in this table by the columns of statistics marked by letter A, B, C, etc.. in the order they were specified in mongo_parser.pl. The header of this table contains specifications of each this variation as the set of the same mongo fs-options which have different values. Absence of any fs-option means it was specified by default value. If this table represents more then one file system variations, we assume by default that A is main, and B, C, ... is A-relative variations. It means that all the statistics of B, C,... are divided on the appropriate statistics of A. The options specified by identical values for all the file system variations locates in special header. The statistics of each phase variation are specified by subheading (numerated by #1, #2, ...). Here is an example of Mongo comparative table. maintained by edward@namesys.com Last modified: Fri Dec 20 13:32:25 MSK 2002 [[category:ReiserFS]] [[category:Reiser4]] 23749d0448ad56189f9a1117f8d2be7b55d9b571 Mount 0 28 1667 1562 2010-02-10T05:13:50Z Chris goe 2 formatting fixes == ReiserFS mount options == === acl === Enable POSIX Access Control Lists. See the [http://acl.bestbits.at/man/man.shtml acl(5) manual page]. Example: mount -t reiserfs -o acl /dev/sdb1 /mnt/scsi-disk-b === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and 'file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Specifies an external device to be used as the journal device. Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === user_xattr === Enable Extended User Attributes. See the [http://acl.bestbits.at/man/man.shtml attr(5) manual page]. Example: mount -t reiserfs -o user_xattr /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with [http://sources.redhat.com/lvm2/ LVM] devices. There is a special resizer utility called [[resize_reiserfs]]. Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation - This may give you performance improvements in some situations. * no_unhashed_relocation - This may give you performance improvements in some situations. * noborder - Disables the 'border allocator algorithm' invented by [mailto:yura@yura.polnet.botik.ru Yury Yu. Rupasov]. This may give you performance improvements in some situations. * border - Enables the 'border allocator algorithm' invented by [mailto:yura@yura.polnet.botik.ru Yury Yu. Rupasov]. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b == Linux 2.6 specific mount options == === data === Specifies the journalling mode for file data. Metadata is always journaled. * journal - All data is committed into the journal prior to being written into the main file system. * ordered -This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. * writeback - Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. Example: mount -t reiserfs -o data=writeback /dev/sdb1 /mnt/scsi-disk-b == Linux 2.4 specific mount options == === hash === Choose hash function ReiserFS will use to find files within directories. Long time ago ReiserFS had only one hash, so hash code was not marked in filesystem superblock. Then additional hashes became available so we had to put hash code into super block. Also, old hash was made notdefault. At that time there were already a number of filesystems with not set hash code in super block. So, mount option was created to make it possible to write proper hash value into super block. Relative merits of hash functions were subjected to discussions of great length on the ReiserFS mailing list. (Try this query.) Roughly speaking: 99% of the time, this option is not required. If the normal autodection code can't determine which hash to use (because both hases had the same value for a file) use this option to force a specific hash. '''It won't allow you to override the existing hash on the filesystem, so if you have a <tt>tea</tt> hash disk, and mount with <tt>-o hash=rupasov</tt>, the mount will fail.''' * rupasov - This hash is invented by [mailto:yura@yura.polnet.botik.ru Yury Yu. Rupasov]. It is fast and preserves locality, mapping lexicographically close file names to the close hash values. '''Never use it, as it has a high probability of hash collisions.''' * [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.281 tea] - This hash is a [http://en.wikipedia.org/wiki/One-way_compression_function#Davies-Meyer Davies-Meyer] function implemented by [mailto:jeremy@goop.org Jeremy Fitzhardinge]. It is hash permuting bits in the name thoroughly. It gets high randomness and, therefore, low probability of hash collision, but this costs performance. Use this if you got <tt>-EHASHCOLLISION</tt> with <tt>r5</tt> hash. * r5 - This hash is a modified version of <tt>rupasov</tt> hash. It is used by default and it is better to stick here until you have to support huge directories and unusual file-name patterns. * detect - This is the instructs mount to detect hash function in use by instance of filesystem being mounted and write this information into superblock. This is only useful on the first mount of old filesystem. Example: mount -t reiserfs -o hash=r5 /dev/sdb1 /mnt/scsi-disk-b [[category:ReiserFS]] 6b3531274574789dc29040eb881a0733e8a4adc5 1562 1561 2009-07-03T07:46:27Z Chris goe 2 tea? == ReiserFS mount options == === acl === Enable POSIX Access Control Lists. See the acl(5) manual page. Example: mount -t reiserfs -o acl /dev/sdb1 /mnt/scsi-disk-b === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Specifies an external device to be used as the journal device. Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === user_xattr === Enable Extended User Attributes. See the attr(5) manual page. Example: mount -t reiserfs -o user_xattr /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility called [[resize_reiserfs]]. Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. * border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b == Linux 2.4 specific mount options == === hash === Choose hash function ReiserFS will use to find files within directories. Long time ago ReiserFS had only one hash, so hash code was not marked in filesystem superblock. Then additional hashes became available so we had to put hash code into super block. Also, old hash was made notdefault. At that time there were already a number of filesystems with not set hash code in super block. So, mount option was created to make it possible to write proper hash value into super block. Relative merits of hash functions were subjected to discussions of great length on the ReiserFS mailing list. (Try this query.) Roughly speaking: 99% of the time, this option is not required. If the normal autodection code can't determine which hash to use (because both hases had the same value for a file) use this option to force a specific hash. '''It won't allow you to override the existing hash on the filesystem, so if you have a <tt>tea</tt> hash disk, and mount with <tt>-o hash=rupasov</tt>, the mount will fail.''' * rupasov This hash is invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. It is fast and preserves locality, mapping lexicographically close file names to the close hash values. '''Never use it, as it has a high probability of hash collisions.''' * [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.281 tea] This hash is a [http://en.wikipedia.org/wiki/One-way_compression_function#Davies-Meyer Davies-Meyer] function implemented by Jeremy Fitzhardinge <jeremy@goop.org>. It is hash permuting bits in the name thoroughly. It gets high randomness and, therefore, low probability of hash collision, but this costs performance. Use this if you got <tt>-EHASHCOLLISION</tt> with <tt>r5</tt> hash. * r5 This hash is a modified version of <tt>rupasov</tt> hash. It is used by default and it is better to stick here until you have to support huge directories and unusual file-name patterns. * detect This is the instructs mount to detect hash function in use by instance of filesystem being mounted and write this information into superblock. This is only useful on the first mount of old filesystem. Example: mount -t reiserfs -o hash=r5 /dev/sdb1 /mnt/scsi-disk-b == Linux 2.6 specific mount options == === data === Specifies the journalling mode for file data. Metadata is always journaled. * journal All data is committed into the journal prior to being written into the main file system. * ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. * writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. Example: mount -t reiserfs -o data=writeback /dev/sdb1 /mnt/scsi-disk-b [[category:ReiserFS]] cf84cc11393919151318f0ec219b7608a6d53230 1561 1560 2009-07-03T03:54:49Z Chris goe 2 formatting fixes == ReiserFS mount options == === acl === Enable POSIX Access Control Lists. See the acl(5) manual page. Example: mount -t reiserfs -o acl /dev/sdb1 /mnt/scsi-disk-b === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Specifies an external device to be used as the journal device. Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === user_xattr === Enable Extended User Attributes. See the attr(5) manual page. Example: mount -t reiserfs -o user_xattr /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility called [[resize_reiserfs]]. Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. * border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b == Linux 2.4 specific mount options == === hash === Choose hash function ReiserFS will use to find files within directories. Long time ago ReiserFS had only one hash, so hash code was not marked in filesystem superblock. Then additional hashes became available so we had to put hash code into super block. Also, old hash was made notdefault. At that time there were already a number of filesystems with not set hash code in super block. So, mount option was created to make it possible to write proper hash value into super block. Relative merits of hash functions were subjected to discussions of great length on the ReiserFS mailing list. (Try this query.) Roughly speaking: 99% of the time, this option is not required. If the normal autodection code can't determine which hash to use (because both hases had the same value for a file) use this option to force a specific hash. It won't allow you to override the existing hash on the FS, so if you have a tea hash disk, and mount with -o hash=rupasov, the mount will fail. * rupasov This hash is invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. It is fast and preserves locality, mapping lexicographically close file names to the close hash values. Never use it, as it has high probability of hash collisions. * tea This hash is a Davis-Meyer function implemented by Jeremy Fitzhardinge <jeremy@zip.com.au>. It is hash permuting bits in the name thoroughly. It gets high randomness and, therefore, low probability of hash collision, but this costs performance. Use this if you got EHASHCOLLISION with r5 hash. * r5 This hash is a modified version of rupasov hash. It is used by default and it is better to stick here until you have to support huge directories and unusual file-name patterns. * detect This is the instructs mount to detect hash function in use by instance of filesystem being mounted and write this information into superblock. This is only useful on the first mount of old filesystem. Example: mount -t reiserfs -o hash=r5 /dev/sdb1 /mnt/scsi-disk-b == Linux 2.6 specific mount options == === data === Specifies the journalling mode for file data. Metadata is always journaled. * journal All data is committed into the journal prior to being written into the main file system. * ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. * writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. Example: mount -t reiserfs -o data=writeback /dev/sdb1 /mnt/scsi-disk-b [[category:ReiserFS]] e86e0d7e4f7f3a9a46e549606c7373af2a7ae937 1560 1559 2009-07-03T03:44:13Z Chris goe 2 -> resize_reiserfs ReiserFS mount options == Linux 2.4 == === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Specifies an external device to be used as the journal device. Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === attrs/noattrs === Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility called [[resize_reiserfs]]. Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === hash === Choose hash function ReiserFS will use to find files within directories. Long time ago ReiserFS had only one hash, so hash code was not marked in filesystem superblock. Then additional hashes became available so we had to put hash code into super block. Also, old hash was made notdefault. At that time there were already a number of filesystems with not set hash code in super block. So, mount option was created to make it possible to write proper hash value into super block. Relative merits of hash functions were subjected to discussions of great length on the ReiserFS mailing list. (Try this query.) Roughly speaking: 99% of the time, this option is not required. If the normal autodection code can't determine which hash to use (because both hases had the same value for a file) use this option to force a specific hash. It won't allow you to override the existing hash on the FS, so if you have a tea hash disk, and mount with -o hash=rupasov, the mount will fail. * rupasov This hash is invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. It is fast and preserves locality, mapping lexicographically close file names to the close hash values. Never use it, as it has high probability of hash collisions. * tea This hash is a Davis-Meyer function implemented by Jeremy Fitzhardinge <jeremy@zip.com.au>. It is hash permuting bits in the name thoroughly. It gets high randomness and, therefore, low probability of hash collision, but this costs performance. Use this if you got EHASHCOLLISION with r5 hash. * r5 This hash is a modified version of rupasov hash. It is used by default and it is better to stick here until you have to support huge directories and unusual file-name patterns. * detect This is the instructs mount to detect hash function in use by instance of filesystem being mounted and write this information into superblock. This is only useful on the first mount of old filesystem. Example: mount -t reiserfs -o hash=r5 /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. * border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b == Linux 2.6 == === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === attrs / noattrs === Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility called [[resize_reiserfs]]. Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === data === Specifies the journalling mode for file data. Metadata is always journaled. * journal All data is committed into the journal prior to being written into the main file system. * ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. * writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. Example: mount -t reiserfs -o data=writeback /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. * border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b [[category:ReiserFS]] 0625ef673116e12adfc7e62a3bffc36d71decb88 1559 1558 2009-07-03T03:43:34Z Chris goe 2 -> resize_reiserfs ReiserFS mount options == Linux 2.4 == === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Specifies an external device to be used as the journal device. Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === attrs/noattrs === Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility called [[resize_reiserfs]]. Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === hash === Choose hash function ReiserFS will use to find files within directories. Long time ago ReiserFS had only one hash, so hash code was not marked in filesystem superblock. Then additional hashes became available so we had to put hash code into super block. Also, old hash was made notdefault. At that time there were already a number of filesystems with not set hash code in super block. So, mount option was created to make it possible to write proper hash value into super block. Relative merits of hash functions were subjected to discussions of great length on the ReiserFS mailing list. (Try this query.) Roughly speaking: 99% of the time, this option is not required. If the normal autodection code can't determine which hash to use (because both hases had the same value for a file) use this option to force a specific hash. It won't allow you to override the existing hash on the FS, so if you have a tea hash disk, and mount with -o hash=rupasov, the mount will fail. * rupasov This hash is invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. It is fast and preserves locality, mapping lexicographically close file names to the close hash values. Never use it, as it has high probability of hash collisions. * tea This hash is a Davis-Meyer function implemented by Jeremy Fitzhardinge <jeremy@zip.com.au>. It is hash permuting bits in the name thoroughly. It gets high randomness and, therefore, low probability of hash collision, but this costs performance. Use this if you got EHASHCOLLISION with r5 hash. * r5 This hash is a modified version of rupasov hash. It is used by default and it is better to stick here until you have to support huge directories and unusual file-name patterns. * detect This is the instructs mount to detect hash function in use by instance of filesystem being mounted and write this information into superblock. This is only useful on the first mount of old filesystem. Example: mount -t reiserfs -o hash=r5 /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. * border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b == Linux 2.6 == === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === attrs / noattrs === Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility which can be obtained from ftp://ftp.namesys.com/pub/reiserfsprogs Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === data === Specifies the journalling mode for file data. Metadata is always journaled. * journal All data is committed into the journal prior to being written into the main file system. * ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. * writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. Example: mount -t reiserfs -o data=writeback /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. * border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b [[category:ReiserFS]] ab48aff9f82306b790b7c0301935aa37dac026bb 1558 1362 2009-07-03T03:42:44Z Chris goe 2 formatting fixes ReiserFS mount options == Linux 2.4 == === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Specifies an external device to be used as the journal device. Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === attrs/noattrs === Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility which can be obtained from ftp://ftp.namesys.com/pub/reiserfsprogs Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === hash === Choose hash function ReiserFS will use to find files within directories. Long time ago ReiserFS had only one hash, so hash code was not marked in filesystem superblock. Then additional hashes became available so we had to put hash code into super block. Also, old hash was made notdefault. At that time there were already a number of filesystems with not set hash code in super block. So, mount option was created to make it possible to write proper hash value into super block. Relative merits of hash functions were subjected to discussions of great length on the ReiserFS mailing list. (Try this query.) Roughly speaking: 99% of the time, this option is not required. If the normal autodection code can't determine which hash to use (because both hases had the same value for a file) use this option to force a specific hash. It won't allow you to override the existing hash on the FS, so if you have a tea hash disk, and mount with -o hash=rupasov, the mount will fail. * rupasov This hash is invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. It is fast and preserves locality, mapping lexicographically close file names to the close hash values. Never use it, as it has high probability of hash collisions. * tea This hash is a Davis-Meyer function implemented by Jeremy Fitzhardinge <jeremy@zip.com.au>. It is hash permuting bits in the name thoroughly. It gets high randomness and, therefore, low probability of hash collision, but this costs performance. Use this if you got EHASHCOLLISION with r5 hash. * r5 This hash is a modified version of rupasov hash. It is used by default and it is better to stick here until you have to support huge directories and unusual file-name patterns. * detect This is the instructs mount to detect hash function in use by instance of filesystem being mounted and write this information into superblock. This is only useful on the first mount of old filesystem. Example: mount -t reiserfs -o hash=r5 /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. * border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b == Linux 2.6 == === conv === Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b === nolog === Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b === notail === By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b === replayonly === Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b === jdev === Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b === attrs / noattrs === Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b === resize === Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility which can be obtained from ftp://ftp.namesys.com/pub/reiserfsprogs Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b === data === Specifies the journalling mode for file data. Metadata is always journaled. * journal All data is committed into the journal prior to being written into the main file system. * ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. * writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. Example: mount -t reiserfs -o data=writeback /dev/sdb1 /mnt/scsi-disk-b === block-allocator === Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). * hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. * noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. * border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b [[category:ReiserFS]] ecdad207bb85e62be8af3dcf830c4b01303b64b1 1362 1336 2009-06-25T09:05:53Z Chris goe 2 category added ReiserFS Mount Options linux kernels 2.4.x conv Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b nolog Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b notail By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b replayonly Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b jdev=journal_device Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b attrs Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b noattrs Example: mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b resize=NUMBER Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility which can be obtained from ftp://ftp.namesys.com/pub/reiserfsprogs Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b hash=rupasov / tea / r5 / detect Choose hash function ReiserFS will use to find files within directories. Long time ago ReiserFS had only one hash, so hash code was not marked in filesystem superblock. Then additional hashes became available so we had to put hash code into super block. Also, old hash was made notdefault. At that time there were already a number of filesystems with not set hash code in super block. So, mount option was created to make it possible to write proper hash value into super block. Relative merits of hash functions were subjected to discussions of great length on the ReiserFS mailing list. (Try this query.) Roughly speaking: 99% of the time, this option is not required. If the normal autodection code can't determine which hash to use (because both hases had the same value for a file) use this option to force a specific hash. It won't allow you to override the existing hash on the FS, so if you have a tea hash disk, and mount with -o hash=rupasov, the mount will fail. rupasov This hash is invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. It is fast and preserves locality, mapping lexicographically close file names to the close hash values. Never use it, as it has high probability of hash collisions. tea This hash is a Davis-Meyer function implemented by Jeremy Fitzhardinge <jeremy@zip.com.au>. It is hash permuting bits in the name thoroughly. It gets high randomness and, therefore, low probability of hash collision, but this costs performance. Use this if you got EHASHCOLLISION with r5 hash. r5 This hash is a modified version of rupasov hash. It is used by default and it is better to stick here until you have to support huge directories and unusual file-name patterns. detect This is the instructs mount to detect hash function in use by instance of filesystem being mounted and write this information into superblock. This is only useful on the first mount of old filesystem. Example: mount -t reiserfs -o hash=r5 /dev/sdb1 /mnt/scsi-disk-b block-allocator=hashed_relocation / no_unhashed_relocation / noborder / border Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. block-allocator=border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b linux kernels 2.6.x conv Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b nolog Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b notail By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b replayonly Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b jdev=journal_device Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b attrs Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b noattrs Example: mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b resize=NUMBER Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility which can be obtained from ftp://ftp.namesys.com/pub/reiserfsprogs Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b data=ordered / journal / writeback Specifies the journalling mode for file data. Metadata is always journaled. journal All data is committed into the journal prior to being written into the main file system. ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. Example: mount -t reiserfs -o data=writeback /dev/sdb1 /mnt/scsi-disk-b block-allocator=hashed_relocation / no_unhashed_relocation / noborder / border Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. block-allocator=border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b [[category:ReiserFS]] aa3dc9dd602ace4d5ec1b2d7c9ad669d8d92519f 1336 2009-06-25T07:55:00Z Chris goe 2 http://web.archive.org/web/20061113154827/www.namesys.com/mount-options.html ReiserFS Mount Options linux kernels 2.4.x conv Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b nolog Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b notail By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b replayonly Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b jdev=journal_device Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b attrs Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b noattrs Example: mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b resize=NUMBER Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility which can be obtained from ftp://ftp.namesys.com/pub/reiserfsprogs Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b hash=rupasov / tea / r5 / detect Choose hash function ReiserFS will use to find files within directories. Long time ago ReiserFS had only one hash, so hash code was not marked in filesystem superblock. Then additional hashes became available so we had to put hash code into super block. Also, old hash was made notdefault. At that time there were already a number of filesystems with not set hash code in super block. So, mount option was created to make it possible to write proper hash value into super block. Relative merits of hash functions were subjected to discussions of great length on the ReiserFS mailing list. (Try this query.) Roughly speaking: 99% of the time, this option is not required. If the normal autodection code can't determine which hash to use (because both hases had the same value for a file) use this option to force a specific hash. It won't allow you to override the existing hash on the FS, so if you have a tea hash disk, and mount with -o hash=rupasov, the mount will fail. rupasov This hash is invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. It is fast and preserves locality, mapping lexicographically close file names to the close hash values. Never use it, as it has high probability of hash collisions. tea This hash is a Davis-Meyer function implemented by Jeremy Fitzhardinge <jeremy@zip.com.au>. It is hash permuting bits in the name thoroughly. It gets high randomness and, therefore, low probability of hash collision, but this costs performance. Use this if you got EHASHCOLLISION with r5 hash. r5 This hash is a modified version of rupasov hash. It is used by default and it is better to stick here until you have to support huge directories and unusual file-name patterns. detect This is the instructs mount to detect hash function in use by instance of filesystem being mounted and write this information into superblock. This is only useful on the first mount of old filesystem. Example: mount -t reiserfs -o hash=r5 /dev/sdb1 /mnt/scsi-disk-b block-allocator=hashed_relocation / no_unhashed_relocation / noborder / border Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. block-allocator=border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b linux kernels 2.6.x conv Instructs 3.6 ReiserFS code to mount 3.5 filesystem, using 3.6 format for newly created objects. After this you cannot use it through 3.5 ReiserFS tools anymore. This option causes conversion of old format super block to the new format. If not specified - old partition will be dealt with in a manner of 3.5. Example: mount -t reiserfs -o conv /dev/sdb1 /mnt/scsi-disk-b nolog Disable journalling. This will get you slight performance improvement in some situations at the cost of losing fast recovery from crashes. Actually even with this option turned on, ReiserFS still performs all journalling paraphernalia, save for actual writes into journalling area. Implementation of real nolog is work in progress. Example: mount -t reiserfs -o nolog /dev/sdb1 /mnt/scsi-disk-b notail By default, ReiserFS stores small files and `file tails' directly into the tree. This confuses some utilities like LILO. This option is used to disable packing of files into the tree. Example: mount -t reiserfs -o notail /dev/sdb1 /mnt/scsi-disk-b replayonly Replay transactions in journal, but don't actually mount filesystem. Used by fsck, mostly. Example: mount -t reiserfs -o replayonly /dev/sdb1 /mnt/scsi-disk-b jdev=journal_device Example: mount -t reiserfs -o jdev=/dev/sdb2 /dev/sdb1 /mnt/scsi-disk-b attrs Example: mount -t reiserfs -o attrs /dev/sdb1 /mnt/scsi-disk-b noattrs Example: mount -t reiserfs -o noattrs /dev/sdb1 /mnt/scsi-disk-b resize=NUMBER Remount option allowing to expand ReiserFS partition on-line. Make ReiserFS think that device has NUMBER blocks. Useful with LVM devices. There is a special resizer utility which can be obtained from ftp://ftp.namesys.com/pub/reiserfsprogs Example: mount -t reiserfs -o resize=680000 /dev/sdb1 /mnt/scsi-disk-b data=ordered / journal / writeback Specifies the journalling mode for file data. Metadata is always journaled. journal All data is committed into the journal prior to being written into the main file system. ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. Example: mount -t reiserfs -o data=writeback /dev/sdb1 /mnt/scsi-disk-b block-allocator=hashed_relocation / no_unhashed_relocation / noborder / border Tunes block allocator. This option is used for testing experimental features, makes benchmarking new features with and without more convenient, should never be used by users in any code shipped to users (ideally). hashed_relocation Tunes block allocator. This may give you performance improvements in some situations. no_unhashed_relocation Tunes block allocator. This may give you performance improvements in some situations. noborder Disable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. block-allocator=border Enable `border allocator algorithm' invented by Yury Yu. Rupasov <yura@yura.polnet.botik.ru>. This may give you performance improvements in some situations. Example: mount -t reiserfs -o block-allocator=border /dev/sdb1 /mnt/scsi-disk-b Maintainer: grev@namesys.com 838ac37d6977a71bd9968b6be18560c260b7dc72 Mount4 0 88 4199 2202 2016-12-24T21:25:33Z DusanC 30310 The mount options for [[Reiser4]] are currently only "documented" in [https://github.com/edward6/reiser4/blob/master/init_super.c]: * '''tmgr.atom_max_size''' - Atoms containing more than N blocks will be forced to commit. N is decimal * '''tmgr.atom_max_age''' - Atoms older than N seconds will be forced to commit. N is decimal. * '''tmgr.atom_min_size''' - In committing an atom to free dirty pages, force the atom less than N in size to fuse with another one. * '''tmgr.atom_max_flushers''' - limit of concurrent flushers for one atom. 0 means no limit. * '''tree.cbk_cache.nr_slots''' - Number of slots in the cbk cache. * '''flush.relocate_threshold''' - If flush finds more than FLUSH_RELOCATE_THRESHOLD adjacent dirty leaf-level blocks it will force them to be relocated. * '''flush.relocate_distance''' - If flush finds can find a block allocation closer than at most FLUSH_RELOCATE_DISTANCE from the preceder it will relocate to that position. * '''flush.written_threshold''' - If we have written this much or more blocks before encountering busy jnode in flush list - abort flushing hoping that next time we get called this jnode will be clean already, and we will save some seeks. * '''flush.scan_maxnodes''' - The maximum number of nodes to scan left on a level during flush. * '''optimal_io_size''' - The preferred IO size * '''tree.carry.new_node_flags''' - This instruct to carry flags used for insertion of new nodes * '''tree.carry.new_extent_flags''' - This instruct to carry flags used for insertion of new extents * '''tree.carry.paste_flags''' - This instruct to carry flags used for paste operations * '''tree.carry.insert_flags''' - This instruct to carry flags used for insert operations * '''altsuper''' - Alternative master superblock location in case if it's original location is not writeable/accessable. This is offset in BYTES. [[category:Reiser4]] 0ea3425164972403a041754a9ccde2892791a58e 2202 1754 2011-03-09T01:32:02Z Runoverheads 12 fix zen git url The mount options for [[Reiser4]] are currently only "documented" in [http://git.zen-kernel.org/zen/tree/fs/reiser4/init_super.c#n316 fs/reiser4/init_super.c]: * '''tmgr.atom_max_size''' - Atoms containing more than N blocks will be forced to commit. N is decimal * '''tmgr.atom_max_age''' - Atoms older than N seconds will be forced to commit. N is decimal. * '''tmgr.atom_min_size''' - In committing an atom to free dirty pages, force the atom less than N in size to fuse with another one. * '''tmgr.atom_max_flushers''' - limit of concurrent flushers for one atom. 0 means no limit. * '''tree.cbk_cache.nr_slots''' - Number of slots in the cbk cache. * '''flush.relocate_threshold''' - If flush finds more than FLUSH_RELOCATE_THRESHOLD adjacent dirty leaf-level blocks it will force them to be relocated. * '''flush.relocate_distance''' - If flush finds can find a block allocation closer than at most FLUSH_RELOCATE_DISTANCE from the preceder it will relocate to that position. * '''flush.written_threshold''' - If we have written this much or more blocks before encountering busy jnode in flush list - abort flushing hoping that next time we get called this jnode will be clean already, and we will save some seeks. * '''flush.scan_maxnodes''' - The maximum number of nodes to scan left on a level during flush. * '''optimal_io_size''' - The preferred IO size * '''tree.carry.new_node_flags''' - This instruct to carry flags used for insertion of new nodes * '''tree.carry.new_extent_flags''' - This instruct to carry flags used for insertion of new extents * '''tree.carry.paste_flags''' - This instruct to carry flags used for paste operations * '''tree.carry.insert_flags''' - This instruct to carry flags used for insert operations * '''altsuper''' - Alternative master superblock location in case if it's original location is not writeable/accessable. This is offset in BYTES. [[category:Reiser4]] 89f1cd852945fc39f2018f1cb0f3f0ae4296430d 1754 1680 2010-07-03T09:39:32Z D314 10 The mount options for [[Reiser4]] are currently only "documented" in [http://git.zen-kernel.org/?p=kernel/zen.git;a=blob;f=fs/reiser4/init_super.c;hb=HEAD#l315 fs/reiser4/init_super.c]: * '''tmgr.atom_max_size''' - Atoms containing more than N blocks will be forced to commit. N is decimal * '''tmgr.atom_max_age''' - Atoms older than N seconds will be forced to commit. N is decimal. * '''tmgr.atom_min_size''' - In committing an atom to free dirty pages, force the atom less than N in size to fuse with another one. * '''tmgr.atom_max_flushers''' - limit of concurrent flushers for one atom. 0 means no limit. * '''tree.cbk_cache.nr_slots''' - Number of slots in the cbk cache. * '''flush.relocate_threshold''' - If flush finds more than FLUSH_RELOCATE_THRESHOLD adjacent dirty leaf-level blocks it will force them to be relocated. * '''flush.relocate_distance''' - If flush finds can find a block allocation closer than at most FLUSH_RELOCATE_DISTANCE from the preceder it will relocate to that position. * '''flush.written_threshold''' - If we have written this much or more blocks before encountering busy jnode in flush list - abort flushing hoping that next time we get called this jnode will be clean already, and we will save some seeks. * '''flush.scan_maxnodes''' - The maximum number of nodes to scan left on a level during flush. * '''optimal_io_size''' - The preferred IO size * '''tree.carry.new_node_flags''' - This instruct to carry flags used for insertion of new nodes * '''tree.carry.new_extent_flags''' - This instruct to carry flags used for insertion of new extents * '''tree.carry.paste_flags''' - This instruct to carry flags used for paste operations * '''tree.carry.insert_flags''' - This instruct to carry flags used for insert operations * '''altsuper''' - Alternative master superblock location in case if it's original location is not writeable/accessable. This is offset in BYTES. [[category:Reiser4]] 91155104560788389085be3d66f26afd0ca67585 1680 2010-02-10T11:32:25Z Chris goe 2 Created page with 'The mount options for [[Reiser4]] are currently only "documented" in [http://git.zen-kernel.org/?p=kernel/zen.git;a=blob;f=fs/reiser4/init_super.c;hb=HEAD#l315 fs/reiser4/init_su…' The mount options for [[Reiser4]] are currently only "documented" in [http://git.zen-kernel.org/?p=kernel/zen.git;a=blob;f=fs/reiser4/init_super.c;hb=HEAD#l315 fs/reiser4/init_super.c]: * tmgr.atom_max_size - Atoms containing more than N blocks will be forced to commit. N is decimal * tmgr.atom_max_age * tmgr.atom_min_size * tmgr.atom_max_flushers * tree.cbk_cache.nr_slots * flush.relocate_threshold * flush.relocate_distance * flush.written_threshold * flush.scan_maxnodes * optimal_io_size * tree.carry.new_node_flags * tree.carry.new_extent_flags * tree.carry.paste_flags * tree.carry.insert_flags * altsuper [[category:Reiser4]] d1ad025c9742b0d52c49995db59214a3c21599c1 NamesysBenchmarks 0 37 2102 1629 2010-10-27T22:52:48Z Chris goe 2 -> File:Slow.c.txt <font color="red">This page is a disaster, do we even want to clean this up? It's all stale benchmarks anyway :-\</font> __TOC__ == Benchmarks Of Reiser4 == The <tt>htree</tt> (<tt>-O dir_index</tt>) feature is the recent attempt by ext3 developers to handle large directories as well as ReiserFS by using better than linear search algorithms. One of the interesting results in this benchmark was that <tt>htree</tt> does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing <tt>htree</tt> would be. If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. [[Reiser4]] is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon. Finally, remember that Reiser4 is more space efficient than [[ReiserFS]], the <tt>df(1)</tt> measurements are there for looking at....;-) === mongo 2.6.15-mm4 === [[Mongo]] comparison, ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with [ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches cryptcompress] regular file plugin. * 2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006 * cryptcompress-4.patch * mem total = 516312 KB * Intel(R) Xeon(TM) CPU 2.40GHz, running UP kernel <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt> <tr><td bgcolor=black colspan=13><font color=white> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> Legend: <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also. === mongo 2.6.11 === [[mongo]] comparison against xfs and ext2 <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2.6.8.1-mm3 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> === slow.c 2004-03-26 === [[:File:Slow.c.txt|slow.c]] comparison against ext2 and ext3, 2004-03-26 <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> === mongo 2003-11-20 === [[mongo]] comparison against ext3, 2003-11-20 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-09-25 === [[mongo]] comparison against ext3, 2003-09-25 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-28 === [[mongo]] comparison against ext3, 2003-08-28 <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-27 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-26 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-18 === [[mongo]] comparison</a> against ext3 <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-12 === [[mongo]] comparison against ext3 <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-07-10 === [[mongo]] comparison, reiserfs vs. reiser4, 2003-07-10, obtained before [http://mail.fsfeurope.org/pipermail/booth/2003-February/000083.html LinuxTAG 2003] <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> === bonnie++ 2003-09-30 === Bonnie++ comparison, ext3 vs reiser4 (2003-09-30) This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:Reiser4]] [[category:formatting-fixes-needed]] 81b79d9c419449422cd34df740b2a003d66357e6 1629 1628 2009-08-31T07:44:55Z Chris goe 2 ouch <font color="red">This page is a disaster, do we even want to clean this up? It's all stale benchmarks anyway :-\</font> __TOC__ == Benchmarks Of Reiser4 == The <tt>htree</tt> (<tt>-O dir_index</tt>) feature is the recent attempt by ext3 developers to handle large directories as well as ReiserFS by using better than linear search algorithms. One of the interesting results in this benchmark was that <tt>htree</tt> does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing <tt>htree</tt> would be. If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. [[Reiser4]] is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon. Finally, remember that Reiser4 is more space efficient than [[ReiserFS]], the <tt>df(1)</tt> measurements are there for looking at....;-) === mongo 2.6.15-mm4 === [[Mongo]] comparison, ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with [ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches cryptcompress] regular file plugin. * 2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006 * cryptcompress-4.patch * mem total = 516312 KB * Intel(R) Xeon(TM) CPU 2.40GHz, running UP kernel <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt> <tr><td bgcolor=black colspan=13><font color=white> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> Legend: <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also. === mongo 2.6.11 === [[mongo]] comparison against xfs and ext2 <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2.6.8.1-mm3 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> === slow.c 2004-03-26 === [[slow.c]] comparison against ext2 and ext3, 2004-03-26 <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> === mongo 2003-11-20 === [[mongo]] comparison against ext3, 2003-11-20 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-09-25 === [[mongo]] comparison against ext3, 2003-09-25 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-28 === [[mongo]] comparison against ext3, 2003-08-28 <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-27 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-26 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-18 === [[mongo]] comparison</a> against ext3 <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-12 === [[mongo]] comparison against ext3 <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-07-10 === [[mongo]] comparison, reiserfs vs. reiser4, 2003-07-10, obtained before [http://mail.fsfeurope.org/pipermail/booth/2003-February/000083.html LinuxTAG 2003] <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> === bonnie++ 2003-09-30 === Bonnie++ comparison, ext3 vs reiser4 (2003-09-30) This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:Reiser4]] [[category:formatting-fixes-needed]] fe1b29ec50fe34c5a894eb31aff8d8f09833ee2f 1628 1626 2009-08-31T07:42:28Z Chris goe 2 formatting fixes == Benchmarks Of Reiser4 == The <tt>htree</tt> (<tt>-O dir_index</tt>) feature is the recent attempt by ext3 developers to handle large directories as well as ReiserFS by using better than linear search algorithms. One of the interesting results in this benchmark was that <tt>htree</tt> does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing <tt>htree</tt> would be. If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. [[Reiser4]] is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon. Finally, remember that Reiser4 is more space efficient than [[ReiserFS]], the <tt>df(1)</tt> measurements are there for looking at....;-) === mongo 2.6.15-mm4 === [[Mongo]] comparison, ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with [ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches cryptcompress] regular file plugin. * 2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006 * cryptcompress-4.patch * mem total = 516312 KB * Intel(R) Xeon(TM) CPU 2.40GHz, running UP kernel <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt> <tr><td bgcolor=black colspan=13><font color=white> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> Legend: <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also. === mongo 2.6.11 === [[mongo]] comparison against xfs and ext2 <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2.6.8.1-mm3 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> === slow.c 2004-03-26 === [[slow.c]] comparison against ext2 and ext3, 2004-03-26 <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> === mongo 2003-11-20 === [[mongo]] comparison against ext3, 2003-11-20 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-09-25 === [[mongo]] comparison against ext3, 2003-09-25 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-28 === [[mongo]] comparison against ext3, 2003-08-28 <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-27 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-26 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-18 === [[mongo]] comparison</a> against ext3 <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-12 === [[mongo]] comparison against ext3 <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-07-10 === [[mongo]] comparison, reiserfs vs. reiser4, 2003-07-10, obtained before [http://mail.fsfeurope.org/pipermail/booth/2003-February/000083.html LinuxTAG 2003] <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> === bonnie++ 2003-09-30 === Bonnie++ comparison, ext3 vs reiser4 (2003-09-30) This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:Reiser4]] [[category:formatting-fixes-needed]] c34becc6a1c31aefc570392bb33e7f6906f00a08 1626 1625 2009-08-31T07:26:42Z Chris goe 2 formatting fixes == Benchmarks Of Reiser4 == The <tt>htree</tt> (<tt>-O dir_index</tt>) feature is the recent attempt by ext3 developers to handle large directories as well as ReiserFS by using better than linear search algorithms. One of the interesting results in this benchmark was that <tt>htree</tt> does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing <tt>htree</tt> would be. If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. [[Reiser4]] is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon. Finally, remember that Reiser4 is more space efficient than [[ReiserFS]], the <tt>df(1)</tt> measurements are there for looking at....;-) === mongo 2.6.15-mm4 === [[Mongo]] comparison, ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with [ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches cryptcompress] regular file plugin. <dl> <dt>reiser4 </dt> <dd>2.6.15-mm4 cryptcompress-4.patch</dd> <dt>mem total</dt> <dd>516312</dd> <dt>machine </dt> <dd>Intel(R) Xeon(TM) CPU 2.40GHz, <b>running UP kernel</b></dd> <dt>kernel </dt> <dd>2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006</dd> <dt>date </dt> <dd>Sat Feb 11 21:03:21 2006</dd> <dd>Sat Feb 11 21:18:43 2006</dd> <dd>Sat Feb 11 21:37:52 2006</dd> </dl> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr><td colspan=13 align=right> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> <tr><td colspan=13 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <!-- <p><b>Legend:</b> <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also.</p> --><p><a href="http://www.namesys.com/intbenchmarks/mongo/06.02.11.belka.crc/charts/comp.html">The same results in the charts</a></p> === mongo 2.6.11 === [[mongo]] comparison against xfs and ext2 <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2.6.8.1-mm3 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> === slow.c 2004-03-26 === [[slow.c]] comparison against ext2 and ext3, 2004-03-26 <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> === mongo 2003-11-20 === [[mongo]] comparison against ext3, 2003-11-20 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-09-25 === [[mongo]] comparison against ext3, 2003-09-25 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-28 === [[mongo]] comparison against ext3, 2003-08-28 <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-27 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-26 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-18 === [[mongo]] comparison</a> against ext3 <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-12 === [[mongo]] comparison against ext3 <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-07-10 === [[mongo]] comparison, reiserfs vs. reiser4, 2003-07-10, obtained before [http://mail.fsfeurope.org/pipermail/booth/2003-February/000083.html LinuxTAG 2003] <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> === bonnie++ 2003-09-30 === Bonnie++ comparison, ext3 vs reiser4 (2003-09-30) This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:Reiser4]] [[category:formatting-fixes-needed]] 530fa137e99854c9e73d69cd5a4b6340271c352b 1625 1518 2009-08-31T07:20:04Z Chris goe 2 formatting fixes == Benchmarks Of Reiser4 == The <tt>htree</tt> (<tt>-O dir_index</tt>) feature is the recent attempt by ext3 developers to handle large directories as well as ReiserFS by using better than linear search algorithms. One of the interesting results in this benchmark was that <tt>htree</tt> does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing <tt>htree</tt> would be. If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. [[Reiser4]] is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon. Finally, remember that Reiser4 is more space efficient than [[ReiserFS], the <tt>df(1)</tt> measurements are there for looking at....;-) * external benchmarks [[#grant|by Grant Miner]] === mongo 2.6.15-mm4 === [[Mongo]] comparison, ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with [ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches cryptcompress] regular file plugin. <dl> <dt>reiser4 </dt> <dd>2.6.15-mm4 cryptcompress-4.patch</dd> <dt>mem total</dt> <dd>516312</dd> <dt>machine </dt> <dd>Intel(R) Xeon(TM) CPU 2.40GHz, <b>running UP kernel</b></dd> <dt>kernel </dt> <dd>2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006</dd> <dt>date </dt> <dd>Sat Feb 11 21:03:21 2006</dd> <dd>Sat Feb 11 21:18:43 2006</dd> <dd>Sat Feb 11 21:37:52 2006</dd> </dl> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr><td colspan=13 align=right> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> <tr><td colspan=13 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <!-- <p><b>Legend:</b> <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also.</p> --><p><a href="http://www.namesys.com/intbenchmarks/mongo/06.02.11.belka.crc/charts/comp.html">The same results in the charts</a></p> === mongo 2.6.11 === [[mongo]] comparison against xfs and ext2 <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2.6.8.1-mm3 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> === slow.c 2004-03-26 === [[slow.c]] comparison against ext2 and ext3, 2004-03-26 <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> === mongo 2003-11-20 === [[mongo]] comparison against ext3, 2003-11-20 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-09-25 === [[mongo]] comparison against ext3, 2003-09-25 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-28 === [[mongo]] comparison against ext3, 2003-08-28 <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-27 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-08-26 === [[mongo]] comparison against ext3 <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-18 === [[mongo]] comparison</a> against ext3 <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo, 2003-08-12 === [[mongo]] comparison against ext3 <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> === mongo 2003-07-10 === [[mongo]] comparison, reiserfs vs. reiser4, 2003-07-10, obtained before [http://mail.fsfeurope.org/pipermail/booth/2003-February/000083.html LinuxTAG 2003] <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> === bonnie++ 2003-09-30 === Bonnie++ comparison, ext3 vs reiser4 (2003-09-30) This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:Reiser4]] [[category:formatting-fixes-needed]] f12af5043a5703a05635d786fb1010f9e4a8eeb4 1518 1494 2009-06-27T19:16:42Z Chris goe 2 category added == Benchmarks Of Reiser4 == The <tt>htree</tt> (<tt>-O dir_index</tt>) feature is the recent attempt by ext3 developers to handle large directories as well as ReiserFS by using better than linear search algorithms. One of the interesting results in this benchmark was that <tt>htree</tt> does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing <tt>htree</tt> would be. If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. [[Reiser4]] is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon. Finally, remember that Reiser4 is more space efficient than [[ReiserFS], the <tt>df(1)</tt> measurements are there for looking at....;-) * [[#mongo.2.6.15-mm4|linux-2.6.15-mm4]] mongo comparison, ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin * [[#mongo.2.6.11|linux-2.6.11]] mongo comparison against xfs and ext2 * [[#mongo.2.6.8.1-mm3|linux-2.6.8.1-mm3]] mongo comparison against ext3 * [[#slow.2004.03.26|slow.c]] comparison against ext2 and ext3 (2004-03-26) * [[#mongo.2003.11.20|mongo]] comparison against ext3 (2003-11-20) * [[#bonnie++.2003.09.30|Bonnie++]] comparison, ext3 vs reiser4 (2003-09-30) * [[#mongo.2003.09.25|mongo]] comparison against ext3 (2003-09-25) <!-- <li>2003.08.28 mongo <a href="#mongo.2003.08.28">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.27 mongo <a href="#mongo.2003.08.27">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.26 mongo <a href="#mongo.2003.08.26">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.18 mongo <a href="#mongo.2003.08.18">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.12 mongo <a href="#mongo.2003.08.12">comparison</a> against <tt>ext3</tt> </li> --> * [[#mongo.2003.08.28|older mongo results]] (2003-08-28) * [[#mongo.2003.07.10|mongo]] comparison, reiserfs vs. reiser4. (2003-07-10, obtained before [http://mail.fsfeurope.org/pipermail/booth/2003-February/000083.html LinuxTAG 2003] * external benchmarks [[#grant|by Grant Miner]] === mongo.2.6.15-mm4 === * linux-2.6.15-mm4, [[mongo]] results Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with [ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches cryptcompress] regular file plugin. <dl> <dt>reiser4 </dt> <dd>2.6.15-mm4 cryptcompress-4.patch</dd> <dt>mem total</dt> <dd>516312</dd> <dt>machine </dt> <dd>Intel(R) Xeon(TM) CPU 2.40GHz, <b>running UP kernel</b></dd> <dt>kernel </dt> <dd>2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006</dd> <dt>date </dt> <dd>Sat Feb 11 21:03:21 2006</dd> <dd>Sat Feb 11 21:18:43 2006</dd> <dd>Sat Feb 11 21:37:52 2006</dd> </dl> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr><td colspan=13 align=right> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> <tr><td colspan=13 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <!-- <p><b>Legend:</b> <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also.</p> --><p><a href="http://www.namesys.com/intbenchmarks/mongo/06.02.11.belka.crc/charts/comp.html">The same results in the charts</a></p> <hr> <a name="mongo.2.6.11"></a> linux-2.6.11 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2.6.8.1-mm3"></a> linux-2.6.8.1-mm3 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="slow.2004.03.26">2004.03.26 slow.c benchmark results</a> <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> <hr> <a name="mongo.2003.11.20"></a>2003.11.20 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.09.25"></a>2003.09.25 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.28"></a>2003.08.28 <a href="benchmarks/mongo_readme.html">mongo</a> results <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.27"></a>2003.08.27 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.26"></a>2003.08.26 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.18"></a>2003.08.18 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.12"></a>2003.08.12 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> <a name="mongo.2003.07.23"></a> Below is older (2003.07.23) mongo results. </p> <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <hr> <a name="bonnie++.2003.09.30"> This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:Reiser4]] [[category:formatting-fixes-needed]] 1798007ac61883eb3f0b0fc6b703f6d7d10fc5ab 1494 1493 2009-06-27T09:41:32Z Chris goe 2 /* t */ == Benchmarks Of Reiser4 == The <tt>htree</tt> (<tt>-O dir_index</tt>) feature is the recent attempt by ext3 developers to handle large directories as well as ReiserFS by using better than linear search algorithms. One of the interesting results in this benchmark was that <tt>htree</tt> does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing <tt>htree</tt> would be. If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. [[Reiser4]] is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon. Finally, remember that Reiser4 is more space efficient than [[ReiserFS], the <tt>df(1)</tt> measurements are there for looking at....;-) * [[#mongo.2.6.15-mm4|linux-2.6.15-mm4]] mongo comparison, ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin * [[#mongo.2.6.11|linux-2.6.11]] mongo comparison against xfs and ext2 * [[#mongo.2.6.8.1-mm3|linux-2.6.8.1-mm3]] mongo comparison against ext3 * [[#slow.2004.03.26|slow.c]] comparison against ext2 and ext3 (2004-03-26) * [[#mongo.2003.11.20|mongo]] comparison against ext3 (2003-11-20) * [[#bonnie++.2003.09.30|Bonnie++]] comparison, ext3 vs reiser4 (2003-09-30) * [[#mongo.2003.09.25|mongo]] comparison against ext3 (2003-09-25) <!-- <li>2003.08.28 mongo <a href="#mongo.2003.08.28">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.27 mongo <a href="#mongo.2003.08.27">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.26 mongo <a href="#mongo.2003.08.26">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.18 mongo <a href="#mongo.2003.08.18">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.12 mongo <a href="#mongo.2003.08.12">comparison</a> against <tt>ext3</tt> </li> --> * [[#mongo.2003.08.28|older mongo results]] (2003-08-28) * [[#mongo.2003.07.10|mongo]] comparison, reiserfs vs. reiser4. (2003-07-10, obtained before [http://mail.fsfeurope.org/pipermail/booth/2003-February/000083.html LinuxTAG 2003] * external benchmarks [[#grant|by Grant Miner]] === mongo.2.6.15-mm4 === * linux-2.6.15-mm4, [[mongo]] results Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with [ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches cryptcompress] regular file plugin. <dl> <dt>reiser4 </dt> <dd>2.6.15-mm4 cryptcompress-4.patch</dd> <dt>mem total</dt> <dd>516312</dd> <dt>machine </dt> <dd>Intel(R) Xeon(TM) CPU 2.40GHz, <b>running UP kernel</b></dd> <dt>kernel </dt> <dd>2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006</dd> <dt>date </dt> <dd>Sat Feb 11 21:03:21 2006</dd> <dd>Sat Feb 11 21:18:43 2006</dd> <dd>Sat Feb 11 21:37:52 2006</dd> </dl> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr><td colspan=13 align=right> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> <tr><td colspan=13 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <!-- <p><b>Legend:</b> <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also.</p> --><p><a href="http://www.namesys.com/intbenchmarks/mongo/06.02.11.belka.crc/charts/comp.html">The same results in the charts</a></p> <hr> <a name="mongo.2.6.11"></a> linux-2.6.11 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2.6.8.1-mm3"></a> linux-2.6.8.1-mm3 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="slow.2004.03.26">2004.03.26 slow.c benchmark results</a> <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> <hr> <a name="mongo.2003.11.20"></a>2003.11.20 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.09.25"></a>2003.09.25 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.28"></a>2003.08.28 <a href="benchmarks/mongo_readme.html">mongo</a> results <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.27"></a>2003.08.27 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.26"></a>2003.08.26 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.18"></a>2003.08.18 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.12"></a>2003.08.12 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> <a name="mongo.2003.07.23"></a> Below is older (2003.07.23) mongo results. </p> <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <hr> <a name="bonnie++.2003.09.30"> This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:ReiserFS]] 0bb312e7eae551cfb56fcd74bb2874f740e9386d 1493 1378 2009-06-27T09:20:05Z Chris goe 2 formatting fixes == Benchmarks Of Reiser4 == The <tt>htree</tt> (<tt>-O dir_index</tt>) feature is the recent attempt by ext3 developers to handle large directories as well as ReiserFS by using better than linear search algorithms. One of the interesting results in this benchmark was that <tt>htree</tt> does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing <tt>htree</tt> would be. If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. [[Reiser4]] is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon. Finally, remember that Reiser4 is more space efficient than [[ReiserFS], the <tt>df(1)</tt> measurements are there for looking at....;-) === t === <html> <hr> <ul> <li><font color=red>linux-2.6.15-mm4</font> : mongo <a href="#mongo.2.6.15-mm4"> comparison</a> <tt>ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin</tt> </li> <li>linux-2.6.11 : mongo <a href="#mongo.2.6.11"> comparison</a> against <tt>xfs and ext2</tt> </li> <li>linux-2.6.8.1-mm3 : mongo <a href="#mongo.2.6.8.1-mm3"> comparison</a> against <tt>ext3</tt> </li> <li>2004.03.26 slow.c <a href="#slow.2004.03.26">comparison</a> against <tt>ext2, ext3</tt> </li> <li>2003.11.20 mongo <a href="#mongo.2003.11.20">comparison</a> against <tt>ext3</tt> </li> <li>Bonnie++ <a href="#bonnie++.2003.09.30">comparison</a> of <tt>reiser4</tt> and <tt>ext3</tt> done at 2003.09.30. </li> <li>2003.09.25 mongo <a href="#mongo.2003.09.25">comparison</a> against <tt>ext3</tt> </li> <!-- <li>2003.08.28 mongo <a href="#mongo.2003.08.28">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.27 mongo <a href="#mongo.2003.08.27">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.26 mongo <a href="#mongo.2003.08.26">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.18 mongo <a href="#mongo.2003.08.18">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.12 mongo <a href="#mongo.2003.08.12">comparison</a> against <tt>ext3</tt> </li> --> <li>Older mongo <a href="#mongo.2003.08.28">results</a> (2003.08.28).</li> <li>mongo <a href="#mongo.2003.07.10">results</a> obtained before LinuxTAG (2003.07.10). Here reiser4 is compared with reiserfs.</li> <li>External benchmarks <a href="#grant">by Grant Miner</a>.</li> </ul> <hr> <a name="mongo.2.6.15-mm4"></a> linux-2.6.15-mm4 <a href="benchmarks/mongo_readme.html">mongo</a> results <p><b>Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with "cryptcompress" regular file plugin</b> <p> <p>The cryptcompress patch against 2.6.15-mm4 and new version of reiser4progs are from <br> ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches </p> <dl> <dt>reiser4 </dt> <dd>2.6.15-mm4 cryptcompress-4.patch</dd> <dt>mem total</dt> <dd>516312</dd> <dt>machine </dt> <dd>Intel(R) Xeon(TM) CPU 2.40GHz, <b>running UP kernel</b></dd> <dt>kernel </dt> <dd>2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006</dd> <dt>date </dt> <dd>Sat Feb 11 21:03:21 2006</dd> <dd>Sat Feb 11 21:18:43 2006</dd> <dd>Sat Feb 11 21:37:52 2006</dd> </dl> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr><td colspan=13 align=right> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> <tr><td colspan=13 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <!-- <p><b>Legend:</b> <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also.</p> --><p><a href="http://www.namesys.com/intbenchmarks/mongo/06.02.11.belka.crc/charts/comp.html">The same results in the charts</a></p> <hr> <a name="mongo.2.6.11"></a> linux-2.6.11 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2.6.8.1-mm3"></a> linux-2.6.8.1-mm3 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="slow.2004.03.26">2004.03.26 slow.c benchmark results</a> <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> <hr> <a name="mongo.2003.11.20"></a>2003.11.20 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.09.25"></a>2003.09.25 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.28"></a>2003.08.28 <a href="benchmarks/mongo_readme.html">mongo</a> results <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.27"></a>2003.08.27 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.26"></a>2003.08.26 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.18"></a>2003.08.18 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.12"></a>2003.08.12 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> <a name="mongo.2003.07.23"></a> Below is older (2003.07.23) mongo results. </p> <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <hr> <a name="bonnie++.2003.09.30"> This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:ReiserFS]] f5ecbd55db3abe23774ebfbd0076f5f841c0c3ee 1378 2009-06-25T10:01:36Z Chris goe 2 http://web.archive.org/web/20061113154555/http://www.namesys.com/benchmarks.html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <BASE HREF="http://www.namesys.com.wstub.archive.org/benchmarks.html"> <title>Benchmarks Of Reiser4</title> </head> <body> <h1>Benchmarks Of ReiserFS Version 4</h1> <body> <hr> <H1>Remarks</H1> <p> Htree (-O dir_index) is the recent attempt by ext3 developers to handle large directories as well as reiserfs by using better than linear search algorithms. One of the interesting results in this benchmark was that htree does bad things to ext3 performance, at least for this benchmark. This means that trying to have usable performance for large directories with ext3 can severely impact your performance for the non-large case. <p> You'll note that in our latest benchmark at the top here we use larger filesets. It seems that ext3 does a poor job of utilizing its write cache for the case where the fileset uses a lot of memory without exceeding it, and by increasing the size of the fileset we get a fairer (read, better for ext3) benchmark for the create phase. The use of filesets small enough to barely fit into RAM for the create (but not the copy) phase was due to my being lax in supervising the benchmarking, but it did reveal something interesting. Probably Andrew Morton will fix that pretty quick --- it's most likely not a deep fix to make like fixing htrees would be. <p> If anyone knows where the tail combining patch for ext3 went to, let us know so we can benchmark that.... good tail combining performance is not trivial to get right and I am wondering if there is a performance reason it did not go in. <p> Keep in mind that these benchmarks are still evolving and maturing, and I need to give the mongo code a complete review again as it has been worked on by others quite a bit. Note that while I like the mongo benchmarks, those who are concerned it may be stacked in our favor can look at the benchmarks run by others on lkml, one of which is at the bottom of this, which while not as elaborate and detailed as mongo, comes up with roughly the same result. <p> Andrew Morton wrote some beautiful readahead code in VM, many thanks to him for what it contributes to V4 performance, unfortunately it should be confessed that these benchmarks utterly fail to measure its cleverness for real world usage patterns. In fact, these benchmarks basically access everything once in each pass, which is not at all realistic in representing typical server workloads. So understand them as validly illuminating some aspects of performance, not all aspects, if you could be so generous. <p> We ran data ordered ext3 benchmarks at the suggestion of Andrew Morton, but they came out slower for this benchmark. We need to increase the base size range to 8k and run again. <p> V4 is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon.... <p> Finally, remember that reiser4 is more space efficient than V3, the df measurements are there for looking at....;-) <hr> <ul> <li><font color=red>linux-2.6.15-mm4</font> : mongo <a href="#mongo.2.6.15-mm4"> comparison</a> <tt>ext3 vs reiser4 with "unixfile" regular file plugin and reiser4 with "cryptcompress" regular file plugin</tt> </li> <li>linux-2.6.11 : mongo <a href="#mongo.2.6.11"> comparison</a> against <tt>xfs and ext2</tt> </li> <li>linux-2.6.8.1-mm3 : mongo <a href="#mongo.2.6.8.1-mm3"> comparison</a> against <tt>ext3</tt> </li> <li>2004.03.26 slow.c <a href="#slow.2004.03.26">comparison</a> against <tt>ext2, ext3</tt> </li> <li>2003.11.20 mongo <a href="#mongo.2003.11.20">comparison</a> against <tt>ext3</tt> </li> <li>Bonnie++ <a href="#bonnie++.2003.09.30">comparison</a> of <tt>reiser4</tt> and <tt>ext3</tt> done at 2003.09.30. </li> <li>2003.09.25 mongo <a href="#mongo.2003.09.25">comparison</a> against <tt>ext3</tt> </li> <!-- <li>2003.08.28 mongo <a href="#mongo.2003.08.28">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.27 mongo <a href="#mongo.2003.08.27">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.26 mongo <a href="#mongo.2003.08.26">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.18 mongo <a href="#mongo.2003.08.18">comparison</a> against <tt>ext3</tt> </li> <li>2003.08.12 mongo <a href="#mongo.2003.08.12">comparison</a> against <tt>ext3</tt> </li> --> <li>Older mongo <a href="#mongo.2003.08.28">results</a> (2003.08.28).</li> <li>mongo <a href="#mongo.2003.07.10">results</a> obtained before LinuxTAG (2003.07.10). Here reiser4 is compared with reiserfs.</li> <li>External benchmarks <a href="#grant">by Grant Miner</a>.</li> </ul> <hr> <a name="mongo.2.6.15-mm4"></a> linux-2.6.15-mm4 <a href="benchmarks/mongo_readme.html">mongo</a> results <p><b>Comparative results of mongo benchmark for ext3 vs reiser4 with "unixfile" regular file plugin vs reiser4 with "cryptcompress" regular file plugin</b> <p> <p>The cryptcompress patch against 2.6.15-mm4 and new version of reiser4progs are from <br> ftp://ftp.namesys.com/pub/tmp/cryptcompress_patches </p> <dl> <dt>reiser4 </dt> <dd>2.6.15-mm4 cryptcompress-4.patch</dd> <dt>mem total</dt> <dd>516312</dd> <dt>machine </dt> <dd>Intel(R) Xeon(TM) CPU 2.40GHz, <b>running UP kernel</b></dd> <dt>kernel </dt> <dd>2.6.15-mm4 #1 Sat Feb 11 20:00:11 MSK 2006</dd> <dt>date </dt> <dd>Sat Feb 11 21:03:21 2006</dd> <dd>Sat Feb 11 21:18:43 2006</dd> <dd>Sat Feb 11 21:37:52 2006</dd> </dl> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4 with "cryptcompress" regular file plugin</li> <li><tt>B</tt> reiser4 with "unixfile" regular file plugin</li> <li><tt>C</tt> ext3</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4 with "cryptcompress" regular file plugin, and ratios against this reiser4 for reiser4 with "unixfile" regular file plugin and ext3. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 with "cryptcompress" regular file plugin is better in this test. <font color=green>Green</font> number means that it loses in this test. </p> <table cols=13 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>A.MKFS=mkfs.reiser4 -y -o create=create_ccreg40,compressMode=col8 MOUNT_OPTIONS=noatime FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>B.MKFS=mkfs.reiser4 -y MOUNT_OPTIONS=noatime FSTYPE=reiser4 (unixfile regular file plugin)</font></th> </tr> <tr> <th bgcolor=#303030 colspan=13 align=left><font color=white>C.MOUNT_OPTIONS=noatime,data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 53.36</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.249 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>28.79</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.493</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>94.36</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.155</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 775856</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 137.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.543 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.931 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>40.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.716</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.975 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>59.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 161.17</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>48.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.433 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.195</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.487 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.291</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.936</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.927</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.941 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.624</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>27.97</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.676</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1551756</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.550 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.825 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 155.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.989</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.824 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.108</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>26.33</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.104</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=13 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>CPU_UTIL</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.02</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.553 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.514</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.619 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>92.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.155 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.149</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 153.76</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>58.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.192 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.147</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.224 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.152</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1909012</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.682 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.685 </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=13><font color=white></td></tr> <tr><td colspan=13 align=right> <tr> <td colspan=13 bgcolor=#303030><b><font color=white>DIR=/mnt1 GAMMA=0.2 WRITE_BUFFER=131072 PHASE_APPEND=off SYNC=off PHASE_DELETE=rm NPROC=1 DEV=/dev/hda9 DD_MBCOUNT=5000 FILE_SIZE=8192 REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.15-mm4 cryptcompress-4.patch PHASE_READ=find BYTES=1024000000 PHASE_OVERWRITE=off PHASE_MODIFY=off </td></tr> <tr><td colspan=13 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <!-- <p><b>Legend:</b> <font color="green">green</font> color means the result is better (less) than reference value from the first column, results marked as <font color="red">red</font> are worse than reference value, best results are <u>underlined</u> other results which fit into 2% margin of the best result are underlined also.</p> --><p><a href="http://www.namesys.com/intbenchmarks/mongo/06.02.11.belka.crc/charts/comp.html">The same results in the charts</a></p> <hr> <a name="mongo.2.6.11"></a> linux-2.6.11 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>reiser4-for-2.6.11-5.patch from <a href="ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11">ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.11</a> </dd> <dt>mem total</dt> <dd>254496</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.11-reiser4-5 #2 SMP Sat Jun 4 20:06:47 MSD 2005</dd> <dt>date </dt> <dd>Fri Jun 17 23:52:17 2005</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>C</tt> ext2</li> <li><tt>D</tt> xfs default</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=17 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>B.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>C.FSTYPE=ext2 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=17 align=left><font color=white>D.MKFS=mkfs.xfs -f FSTYPE=xfs </font></th> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 66.12</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.686 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>34.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.445 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>29.86</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.398</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1623204</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.086 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.77</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.883</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.161 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.606 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.611 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.353</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 151.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.459 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.113 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.978 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>44.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.607 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.470</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.535 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.54</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.724 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>22.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.314 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.812</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.871 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.698 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.571</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.591 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.528</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.709 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.579 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3245428</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>108.77</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.313</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.193 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.071 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.637 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.091</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.45</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.795 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.077</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.556 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 14877.000 </font></tt></td> </tt></td> </tr> <tr> <td colspan=17 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=5000 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=4><b>REAL_TIME</b></td> <td colspan=4><b>CPU_TIME</b></td> <td colspan=4><b>CPU_UTIL</b></td> <td colspan=4><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 536.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.005 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>122.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.826 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.819</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.806</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.99</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.771 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.711</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.742 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>145.32</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.031 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.965</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>157.51</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.890</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.880</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>57.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.901</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.909 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.884</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5120008</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.012</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=17><font color=white></td></tr> <tr><td colspan=17 align=right> <tr> <td colspan=17 bgcolor=#303030><b><font color=white>INFO_R4=2.6.11 + reiser4-5 REP_COUNTER=1 DEV=/dev/hda5 DD_MBCOUNT=5000 PHASE_OVERWRITE=off FILE_SIZE=8192 NPROC=3 PHASE_READ=find PHASE_DELETE=rm PHASE_APPEND=off WRITE_BUFFER=131072 DIR=/mnt1 PHASE_MODIFY=off BYTES=1024000000 PHASE_COPY=cp GAMMA=0.2 SYNC=off </td></tr> <tr><td colspan=17 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2.6.8.1-mm3"></a> linux-2.6.8.1-mm3 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>large key</dd> <dt>mem total</dt> <dd>254324</dd> <dt>machine </dt> <dd>bones</dd> <dt>kernel </dt> <dd>2.6.8.1-mm3 #3 SMP Mon Aug 23 19:33:13 MSD 2004</dd> <dt>date </dt> <dd>Tue Aug 31 15:47:51 2004</dd> </dl> <p> In this test 81% of files are chosen from the 0-10k size range and 19% from the 10-100k size range. </p> <!-- File stats: Units are decimal (1k = 1000) files 0-100 : 1433 files 100-1K : 12597 files 1K-10K : 103101 files 10K-100K : 28131 files 100K-1M : 0 files 1M-10M : 0 files 10M-larger : 0 total bytes written : 1886585039 --> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3 (notail)</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> </ul> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/CREATE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/COPY.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/READ.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/STATS.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/DELETE.0.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_writing_largefile.1.png"> <img src="http://www.namesys.com/intbenchmarks/mongo/04.08.26/256MB.RAM/one-thread-8k.g02.charts/dd_reading_largefile.1.png"> <p> Table presents absolute values (of elapsed time, CPU usage, CPU utilization, disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=25 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>A.FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>B.FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -o extent=extent40 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>C.MOUNT_OPTIONS=notail FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>D.MOUNT_OPTIONS="data=writeback" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>E.MOUNT_OPTIONS="data=journal" FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=25 align=left><font color=white>F.MOUNT_OPTIONS="data=ordered" FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 91.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.988</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.592 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.256 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>31.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.965 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.826</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.577 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.802 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>22.63</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.350</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.738 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1978440</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt>219.5</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.968</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.819 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>54.04</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.792</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.694 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.004 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.860 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>16.01</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.460</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.663 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.839 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.890 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 187.34</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.617 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.282 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.295 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.250 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>38.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.002 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.711 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.622</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.615</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.995 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.441</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.533 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>23.71</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.968 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.943</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.91</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.944 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.661</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.674 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.658</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>24.46</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.971 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.587</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.700 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.707 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.697 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3956708</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.088 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>156.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.233</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.270 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.216 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>53.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.938 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.440 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.209</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.215 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.214 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.23</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.758 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.157</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.160 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.167 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=25 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>CPU_UTIL</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.342 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.24</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.966</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.393 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.43</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.994 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.631</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.655 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.967 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt>28.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.969</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.010 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.980</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.999 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.979 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.014 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.911</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.895</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.922 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.858</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.854</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.867</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=25><font color=white></td></tr> <tr><td colspan=25 align=right> <tr> <td colspan=25 bgcolor=#303030><b><font color=white>REP_COUNTER=1 PHASE_COPY=cp INFO_R4=2.6.8.1-mm3 + parse_options.patch FILE_SIZE=8192 DEV=/dev/hda6 PHASE_MODIFY=off DD_MBCOUNT=768 PHASE_APPEND=off PHASE_OVERWRITE=off SYNC=off DIR=/mnt1 PHASE_DELETE=rm NPROC=1 BYTES=1024000000 GAMMA=0.2 PHASE_READ=find WRITE_BUFFER=131072 </td></tr> <tr><td colspan=25 align=right> <font size=-2>Produced by <a href=http://namesys.com/>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="slow.2004.03.26">2004.03.26 slow.c benchmark results</a> <p> This is <a href="http://www.jburgess.uklinux.net/slow.c">slow.c</a> benchmark resutls for the latest 2004.03.26 reiser4 snapshot. </p> <p> <b>slow.c</b> is a simple program by Jon Burgess which writes and reads multiple data streams. For the details and the source code look at <a href="http://marc.theaimsgroup.com/?l=linux-kernel&m=107652683608384&w=2"> the discussion<a> in the linux-kernel mailing list. </p> <p> kernel : 2.6.5-rc2</p> <p> RAM : 256Mb</p> <p> reiser4 : <a href="http://www.namesys.com/snapshots/2004.03.26/">2004.03.26 snapshot</a></p> <p>Hardware specs:</p> <pre> processor : 1 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) Processor stepping : 2 cpu MHz : 1460.098 cache size : 256 KB bogomips : 2916.35 Dual CPU AMD Athlon(tm) 1.4Ghz </pre> <pre> # hdparm /dev/hda6: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 35937342, start = 84164598 </pre> <pre> # hdparm -t /dev/hda6 /dev/hda6: Timing buffered disk reads: 84 MB in 3.07 seconds = 27.39 MB/sec </pre> <pre> # hdparm -i /dev/hda /dev/hda: Model=IC35L060AVER07-0, FwRev=ER6OA44A, SerialNo=SZPTZMB6154 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=120103200 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-5 T13 1321D revision 1: * signifies the current active mode </pre> <pre> <!-- (500Mb of data) test : ./slow foo 500 Results : ============================================================== | 1 stream | 2 streams --------------+----------------------------------------------- | WRITE READ | WRITE READ --------------+----------------------------------------------- ext2 25.08Mb/s 27.08Mb/s 13.72Mb/s 14.04Mb/s reiser4 26.31Mb/s 26.99Mb/s 24.03Mb/s 26.84Mb/s reiser4-extents 25.28Mb/s 27.40Mb/s 24.12Mb/s 26.85Mb/s ext3-ordered 20.99Mb/s 26.40Mb/s 12.01Mb/s 13.34Mb/s ext3-journal 10.13Mb/s 24.48Mb/s 8.87Mb/s 13.26Mb/s reiserfs 20.42Mb/s 27.67Mb/s 12.98Mb/s 13.13Mb/s reiserfs-notail 20.07Mb/s 27.58Mb/s 13.04Mb/s 13.25Mb/s ============================================================== --> (1000Mb of data) test : ./slow foo 1000 Results : <!-- ============================================================================================================== | 1 stream | 2 streams | 4 streams | 8 stream --------------+----------------------------------------------------------------------------------------------- | WRITE READ | WRITE READ | WRITE READ | WRITE READ --------------+----------------------------------------------------------------------------------------------- ext2 24.66Mb/s 27.56Mb/s 13.40Mb/s 13.67Mb/s 7.73Mb/s 6.94Mb/s 6.69Mb/s 3.52Mb/s reiser4 25.42Mb/s 27.71Mb/s 23.96Mb/s 26.34Mb/s 24.55Mb/s 26.58Mb/s 24.90Mb/s 26.76Mb/s reiser4-extents 25.60Mb/s 27.68Mb/s 24.19Mb/s 25.92Mb/s 25.24Mb/s 27.12Mb/s 25.39Mb/s 26.72Mb/s ext3-ordered 20.05Mb/s 26.46Mb/s 11.06Mb/s 13.12Mb/s 9.63Mb/s 6.76Mb/s 10.02Mb/s 3.48Mb/s ext3-journal 10.10Mb/s 26.81Mb/s 8.87Mb/s 13.08Mb/s 8.59Mb/s 6.84Mb/s 8.14Mb/s 3.47Mb/s reiserfs 20.19Mb/s 27.48Mb/s 12.69Mb/s 13.03Mb/s 8.27Mb/s 6.84Mb/s 7.87Mb/s 4.13Mb/s reiserfs-notail 20.31Mb/s 27.10Mb/s 12.74Mb/s 13.09Mb/s 8.33Mb/s 6.89Mb/s 7.87Mb/s 4.17Mb/s ============================================================================================================= --> </pre> <table> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/wr.8.png"></td> </tr> <tr> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.1.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.2.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.4.png"></td> <td><img src="intbenchmarks/slow/04.03.25-int.snapshot.bones/rd.8.png"></td> </tr> </table> <hr> <a name="mongo.2003.11.20"></a>2003.11.20 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255716</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test9 #2 SMP Thu Nov 20 16:08:42 MSK 2003</dd> <dt>date </dt> <dd>Thu Nov 20 16:16:50 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.81</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.171 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.983 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.253 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.702 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.161 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.212 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.38</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.130 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.461 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.354 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.851</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 607612</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.37</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.089 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.980 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.834 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.929 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.246 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.047 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.797 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.590 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.725 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.542 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.698</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 45.38</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.406 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.307 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.232 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.192 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.934 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.517 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.444</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.74</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.030 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.413 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.014</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.021 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.634 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.34</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.936 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.761 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.791 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.774 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.744</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1214992</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.94</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.424</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.043 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.956 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.315 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.743 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.443 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.200</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.201</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.234 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.33</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.026 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.184 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.102 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.499 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.097 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.098 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.61</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.659</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.054 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.556 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.571 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.796 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.765</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.09.25"></a>2003.09.25 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test5 #33 SMP Thu Sep 25 15:45:38 MSD 2003</dd> <dt>date </dt> <dd>Thu Sep 25 15:57:38 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> reiserfs <tt>v3</tt></li> <li><tt>D</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>E</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>F</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>G</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 23.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.158 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.714 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.263 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.234 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.075 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.947 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.240 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.357 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.264 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.835</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608548</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.98</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.810 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.908 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.776 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.507 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.603 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.518 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 44.65</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.114 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.179 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.694 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.933 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.593</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.620 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.88</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.998 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.139 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.020 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.929</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.655 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.900 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.782 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.747</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.755</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216784</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.438</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.504 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.109 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.023 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.376 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.431 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.206</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.232 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 30.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.017</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.394 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.981 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.553</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.701 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.318 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.96</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.41</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.996 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.718</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.739 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.722</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.28"></a>2003.08.28 <a href="benchmarks/mongo_readme.html">mongo</a> results <body text=black> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #194 SMP Thu Aug 28 17:18:47 MSD 2003</dd> <dt>date </dt> <dd>Thu Aug 28 17:20:18 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=22 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>D.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>E.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>F.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=22 align=left><font color=white>G.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 21.94</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.957 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.049 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.430 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.399 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.558 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>6.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.104 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.913 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.334 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.345 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.821</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 608452</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.105 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 64.05</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.078 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.112 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.964 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.022 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.356 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.039 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.819 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.568 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 52.53</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.882 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.126 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.124 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.158 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.489 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.456</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.551 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>5.82</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.973</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.040 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.048 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.641 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.29</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.991 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.926 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.755 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.742</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.751 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1216572</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>46.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.409</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.949 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.988 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.382 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.89</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.734 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.453 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.210 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.204</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.202</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.238 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=22 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=7><b>REAL_TIME</b></td> <td colspan=7><b>CPU_TIME</b></td> <td colspan=7><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td><td><b>G/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.1</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.205 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.353 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.070 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.028 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.547</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.173 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.708 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.327 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.296 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 18.99</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.072 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.009</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.006</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.12</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.925 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.877 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.844 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.830 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=22><font color=white></td></tr> <tr><td colspan=22 align=right> <tr> <td colspan=22 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8192 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=22 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.27"></a>2003.08.27 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>256276</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #189 SMP Wed Aug 27 20:36:51 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 20:44:02 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 22.41</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.673 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.325 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.975 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.213 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.66</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.069 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.347 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.415 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.410 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.708</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 635264</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.111 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 90.92</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.099 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.471 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.470 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.989 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>12.14</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.068 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.241 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.668</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>82.21</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.861 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.852 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.791</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 4.417 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>10.57</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.914 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.400</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.428 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.402</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.534 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>8.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.822</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.816</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.811</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.335 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.96</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.561</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.564</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.584 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.608 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1269840</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.110 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.112 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>69.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.301</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.749 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.717 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.659 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.912 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.73</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.703 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.208</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.207</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.237 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 25.85</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.335 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.085 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.095 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.982</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.159 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.648 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.251 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.254 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.18</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.963 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.807 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.789</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.803</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=8000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> This is the same test as above, but with base file size 4k, that is, in this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. </p> <hr> <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>255580</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Wed Aug 27 12:41:54 2003</dd> </dl> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 33.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.223 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.305 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.895 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>14.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.118 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.967 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.046 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.045 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.647</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789424</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.68</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.228 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.397 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.277 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.061 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.05</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.484 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.683 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.691</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 118.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.041 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.065 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.020</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.585 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>19.84</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 0.993 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.446 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.540 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>24.69</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.951 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.696 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.677</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.151 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.75</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.008 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.582</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.583</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578216</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>114.49</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.438 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.174</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.188 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.257 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>32.64</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.790 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.199 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.194</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.223 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.24</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.056 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.063 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.25</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.997</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.138 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.622 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.298 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.04</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.994</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.003</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.002</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.08</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.038 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.870 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.837</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.26"></a>2003.08.26 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd>''</dd> <dt>mem total</dt> <dd>904048</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test4 #176 SMP Tue Aug 26 19:09:38 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 26 19:34:39 2003</dd> </dl> <p> In this test 80% of files are chosen from the 0-4k size range, 16% from the 0-40k size range, 0.8 x 4% from the 0-400k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4='' FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4='' MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 27.6</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.311 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.567 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.538 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.668 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.566 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.55</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.166 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.162 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.189 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.670</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 788884</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 113.71</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.237 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.167 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.460 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.227 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.387 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>23.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.169 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.498 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.691 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.591 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.709</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 111.51</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.239 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.157 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>20.76</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.042 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.424 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.415</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.416</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.521 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>20.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.034 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.834</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.827</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.832</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.439 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.47</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.009 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.590</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.585</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.584</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.631 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1577560</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.183 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>110.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.437 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.183</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.180</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.185 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.277 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>33.03</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.838 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.196 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.192</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.193</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.221 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.03</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.096 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.340 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.080 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 3.48</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.583 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.187 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.190 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.995</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.28</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.018 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.737</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.741 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.724</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.18"></a>2003.08.18 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>reiser4 </dt> <dd></dd> <dt>mem total</dt> <dd>255992</dd> <dt>machine </dt> <dd>belka</dd> <dt>kernel </dt> <dd>2.6.0-test3 #37 SMP Mon Aug 18 18:12:14 MSD 2003</dd> <dt>date </dt> <dd>ðÎÄ 18 á×Ç 2003 20:24:16</dd> </dl> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> reiser4, extents only</li> <li><tt>C</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>D</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>E</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>F</tt> ext3 with htree (hashed directories)</li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.INFO_R4=ext MKFS=mkfs.reiser4 -q -o policy=extents FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 29.16</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.220 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.422 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.779 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.491 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.645 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>13.52</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.013 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.087 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.997 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.657</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 789364</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.181 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 119.64</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.191 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.473 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 7.288 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>21.98</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.152 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.515 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.746 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.520 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.695</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 116.55</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.213 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.177 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.025 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.134 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.850 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>18.35</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.035 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.447 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.436</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.431</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.569 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt>21.65</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.779</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.811 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.782</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.358 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>7.56</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.001 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.599</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.612 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.611</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.638 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1578116</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.208 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.180 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.182 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>112.37</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.434 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.179</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.198 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.177</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.281 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>30.62</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.851 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.205</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.203</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.230 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=768 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 26.11</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.090 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.388 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.076 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.083 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>3.25</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.945</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.092 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.640 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.255 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.231 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 19.09</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.999</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 0.996</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.004</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.011</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>2.09</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.019 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.847</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.856 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.833</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.842</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 786436</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/mnt/testfs SYNC=off PHASE_COPY=cp REP_COUNTER=1 GAMMA=0.2 PHASE_OVERWRITE=off FILE_SIZE=4000 BYTES=512000000 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb3 DD_MBCOUNT=768 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.08.12"></a>2003.08.12 <a href="benchmarks/mongo_readme.html">mongo</a> results <dl> <dt>mem total</dt> <dd>513284</dd> <dt>machine </dt> <dd>strelka</dd> <dt>kernel </dt> <dd>2.6.0-test2 #52 SMP Tue Aug 12 15:17:12 MSD 2003</dd> <dt>date </dt> <dd>Tue Aug 12 15:38:47 2003</dd> </dl> <p> This is comparison of latest (2003.08.12) version of reiser4 with ext3. Reiser4 is an atomic filesystem, so the comparison with data journaling mode of ext3 is the fairest, but since most users use ext3 with data ordering mode, we compare against that also.... </p> <p> In this test 80% of files are chosen from the 0-8k size range, 16% from the 0-80k size range, 0.8 x 4% from the 0-800k size range, etc. Most files are small, most bytes are in large files. </p> <p>Legend:</p> <ul> <li><tt>A</tt> reiser4</li> <li><tt>B</tt> ext3 in <tt>data=writeback</tt> mode (meta-data only journalling)</li> <li><tt>C</tt> ext3 in <tt>data=journal</tt> mode</li> <li><tt>D</tt> ext3 in <tt>data=ordered</tt> mode</li> <li><tt>E</tt> ext3 with htree (hashed directories)</li> <li><tt>F</tt> ext3 with support for filetypes in <tt>readdir()</tt></li> </ul> <p> Table presents absolute values (of elapsed time, CPU usage, and disk usage) for reiser4, and ratios against reiser4 for all other configurations. <font color=red>Red</font> number means ratio is larger than <tt>1.0</tt>, that is, reiser4 is better in this test. <font color=green>Green</font> number means that reiser4 loses in this test. </p> <table cols=19 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>A.INFO_R4= MKFS=/usr/local/sbin/mkfs.reiser4 -qf FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>B.MOUNT_OPTIONS=data=writeback FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>C.MOUNT_OPTIONS=data=journal FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>D.MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>E.MKFS=/usr/local/sbin/mkfs.ext3 -O dir_index MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor=#303030 colspan=19 align=left><font color=white>F.MKFS=/usr/local/sbin/mkfs.ext3 -O filetype MOUNT_OPTIONS=data=ordered FSTYPE=ext3 </font></th> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.06</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.317 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.248 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.050 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.016 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.077 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>5.3</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.558 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.692 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.602 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.823</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458224</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 43.62</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.982 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.733 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.033 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.685 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.904 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>9.19</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.163 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.286 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.230 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.706</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.200 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 39.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.091 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.140 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 6.003 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.119 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.22</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.467 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.454 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.464 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.529 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.443</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.54</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.987 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.896 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.942 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.649 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.883 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.26</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.115 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.385 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.962 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916172</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.108 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.107 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>37.85</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.833 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.825 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.867 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.133 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.11</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.223</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.220</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.254 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.222</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=19 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=6><b>REAL_TIME</b></td> <td colspan=6><b>CPU_TIME</b></td> <td colspan=6><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.15</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.062 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.066 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.071 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.073 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.86</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.094 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.500 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.206 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.211 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.198 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.5</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.008</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.007</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.7</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.745</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.732</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.743</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.736</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.734</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=19><font color=white></td></tr> <tr><td colspan=19 align=right> <tr> <td colspan=19 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=19 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <p> <a name="mongo.2003.07.23"></a> Below is older (2003.07.23) mongo results. </p> <table cols=10 cellpadding=2 cellspacing=2 noborder> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>A. reiser4</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>B. ext3 data journalling</th> </tr> <tr> <th bgcolor=#303030 colspan=10 align=left><font color=white>C. ext3 </font></th> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#0:</font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>CREATE</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 14.19</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.221 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 3.592 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 5.66</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.610 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.475 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 458692</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>COPY</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 49.01</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.586 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.783 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 9.08</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.308 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.176 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>READ</b></td> <td bgcolor=#E0E0C0 align=right><tt>43.39</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.970</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> 1.017 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>8.1</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.452</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.453</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>STATS</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 1.93</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.534 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.549 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 0.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.000 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.963 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 916668</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.106 </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>DELETE</b></td> <td bgcolor=#E0E0C0 align=right><tt>40.13</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.797</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.837 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>11.26</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.217 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.210</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.000</U> </font></tt></td> </tt></td> </tr> <tr> <td colspan=10 bgcolor=#606060><b><font color=white>#1:DD_MBCOUNT=500 </font></b></td></tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td colspan=3><b>REAL_TIME</b></td> <td colspan=3><b>CPU_TIME</b></td> <td colspan=3><b>DF</b></td> </tr> <tr align=center bgcolor=#C0C0C0> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_writing_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 42.27</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 2.527 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.057 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 7.78</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.497 </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=red> 1.189 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr> <td bgcolor=#C0C0C0><b>dd_reading_largefile</b></td> <td bgcolor=#E0E0C0 align=right><tt><U> 36.57</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.005</U> </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt>4.8</tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> <U> 0.760</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=green> 0.777 </font></tt></td> </tt></td> <td bgcolor=#E0E0C0 align=right><tt><U> 512004</U></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> <td bgcolor=#E0E0C0 align=right><tt><font color=black> <U> 1.001</U> </font></tt></td> </tt></td> </tr> <tr><td bgcolor=black colspan=10><font color=white></td></tr> <tr><td colspan=10 align=right> <tr> <td colspan=10 bgcolor=#303030><b><font color=white>NPROC=1 DIR=/data1 SYNC=off PHASE_COPY=cp REP_COUNTER=3 GAMMA=0.2 PHASE_OVERWRITE=off PHASE_STATS=find FILE_SIZE=8192 BYTES=134217728 PHASE_APPEND=off PHASE_READ=find DEV=/dev/hdb1 DD_MBCOUNT=500 WRITE_BUFFER=131072 PHASE_DELETE=rm PHASE_MODIFY=off </td></tr> <tr><td colspan=10 align=right> <font size=-2>Produced by <a href=http://namesys.com/benchmarks/mongo_readme.html>Mongo</a> benchmark suite.</font></td></tr> </table> <hr> <a name="mongo.2003.07.10"> <p> The below are some older benchmarks from just before Linux Tag. In these, note that gamma is the fraction of files that are larger than the base size by 10x. This is set either to 0.2 (as in the benchmark above), in an attempt to mimic observed real usage patterns, or to 0, in an attempt to measure a file size range's performance qualities in isolation. Note that V3 performs poorly in the 0-8k size range, and V4 performs well. This is the result of deep design changes you can read about at <a href="http://www.namesys.com/v4/v4.html">http://www.namesys.com/v4/v4.html</a>. <dl><dt>mem total</dt><dd>513748</dd><dt>machine </dt><dd>strelka</dd><dt>kernel </dt><dd>2.5.74 #213 SMP Thu Jul 10 22:53:23 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 22:48:56 2003</dd><dt>.config </dt><dd><a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/.config">here</a></dd><dt>NPROC</dt><dd>1</dd><dt>DIR</dt><dd>/data1</dd><dt>SYNC</dt><dd>off</dd><dt>REP_COUNTER</dt><dd>3</dd><dt>All phases are in readdir order</dt><dd></dd><dt>BYTES</dt><dd>100M</dd><dt>DEV</dt><dd>/dev/hdb1</dd><dt>WRITE_BUFFER</dt><dd><b>256k</b></dd></dl> <p>everywhere <b>A</b> is reiserfs and <b>B</b> is reiser4. Green numbers mean reiser4 is better.</p> <table cols="7" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>41.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.246</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.908</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>154.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.504</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.17</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>282.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.573</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 6.6</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.392 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>944428</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.980</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 284.52</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.986</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.489 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 943592</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 298.19</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.263 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.33</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.608 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>245.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.940</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.85</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.753 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>20.58</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.099</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.48</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.292 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>943548</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.968</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/8k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>117.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.176</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>15.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>524.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.365</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.16</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.059 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1068.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.363</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>31.27</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.937</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2073420</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1081.23</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.670</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.61</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.048 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066536</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.953</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1050.55</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.885</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 22.81</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.017</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>974.43</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.644</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 12.28</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>83.44</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.075</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.26</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.802</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2066424</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.948</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/4k.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.34</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.86</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.938</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>412.28</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.300</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 35.11</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>APPEND</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1198.9</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.164</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>67.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.694</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1631992</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.749</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>MODIFY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1305.14</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.351</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>43.77</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.762</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1613124</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.758</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>OVERWRITE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1390.94</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.239</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>44.22</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.777</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>1093.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.256</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 19.46</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.743 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.76</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.200</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.735</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1610948</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.759</u> </font></tt></td> </tr> <tr> <td colspan="7" bgcolor="#a0a0a0"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096 <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v3.profile">A profile</a> <a href="http://www.namesys.com/intbenchmarks/mongo/03.07.11.nikita/100.heavy.v4.profile">B profile</a></font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 8k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>40.54</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.248</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4.01</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.895</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>321632</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.961</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>152.82</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.506</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 5.2</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.215 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>141.8</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.563</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 3.03</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.762 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>14.91</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.084</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 0.59</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.051 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>642624</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.962</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=8192</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">median file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>115.6</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.174</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>14.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.772</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 667652</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>528.83</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.361</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 18.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.058 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>532.06</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.372</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 10.87</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.589 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>51.99</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.069</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>1.67</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.581</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 1332856</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.002</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.2 FILE_SIZE=4096</font></b></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr><td bgcolor="white" colspan="7"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="7" align="left"><font color="white">maximal file size 4k</font></th> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="2"><b>REAL_TIME</b></td> <td colspan="2"><b>CPU_TIME</b></td> <td colspan="2"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> <td><b>A</b></td><td><b>B/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>77.5</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.309</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>22.24</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.910</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>452252</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.923</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt>415.84</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.297</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 34.9</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt>469.97</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.273</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.14</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.454 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>65.49</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.162</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>3.09</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.599</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>893408</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.934</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="7"><font color="white"></font></td></tr> <tr><td colspan="7" align="right"> </td></tr><tr> <td colspan="7" bgcolor="#303030"><b><font color="white">GAMMA=0.0 FILE_SIZE=4096</font></b></td></tr> </tbody></table> <hr> <h1>Mongo benchmark results</h1> <h2>create, copy, read, stats, delete phases</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 19:21:06 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:FILE_SIZE=4000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 20.47</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.404 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.037 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.024 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.513 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.324 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>12.72</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.143 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.270 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.873 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.615</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.606</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 416332</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.909 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 65.25</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.484 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.953 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.020 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.986 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.267 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>21.98</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.032 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.098 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.732 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.529</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.699 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 75.56</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.349 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.868 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.218 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.902 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.925 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>17.36</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.213 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.745 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.857 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.695 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.681</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>132.18</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.996 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.963</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.994 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.967</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.950</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>2.63</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.977</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.970</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.989</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.981</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.008 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 832640</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.934 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.088 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.910 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.858 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>85.32</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.627 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.239 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.442 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.403</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.449 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>33.57</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.856 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.780 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.623 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.157</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.154</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#1:FILE_SIZE=8000 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>CREATE</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 15.07</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.009</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 8.875 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.709 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.237 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.321 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>8.62</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.945 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.932 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.729 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.517</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.522</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 399788</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>COPY</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 52.24</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.007</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 4.998 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.492 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.562 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.879 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>13.42</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.026 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.264 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.700 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.487</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.635 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>READ</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 60.91</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.013</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 3.738 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.606 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.333 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.340 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>11.66</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.018 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.526</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.749 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.547 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>STATS</b></td> <td bgcolor="#e0e0c0" align="right"><tt>126.53</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.951</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.958</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 0.991 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.004 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.966</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 2.57</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.023 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.988</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.016 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.012 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 799488</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.243 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.461 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.434 </font></tt></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>DELETE</b></td> <td bgcolor="#e0e0c0" align="right"><tt>73.21</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.116 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.746 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.242</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.301 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.396 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>19.93</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.013 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.584 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.530 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.126 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.123</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>4</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> 1.000 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.000</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">PHASE_APPEND=off NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 PHASE_OVERWRITE=off DEV=/dev/hdb3 WRITE_BUFFER=4096 BYTES=128000000 PHASE_MODIFY=off </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <h2>dd of a large file phase</h2> <dl><dt>reiser4 </dt><dd>ChangeSet@1.1095, 2003-07-10 15:22:17+04:00, god@laputa.namesys.com oops ChangeSet@1.1094, 2003-07-10 15:14:06+04:00, god@laputa.namesys.com repairing compilation damage. </dd><dt>mem total</dt><dd>256624</dd><dt>machine </dt><dd>belka</dd><dt>kernel </dt><dd>2.5.74 #28 Thu Jul 10 18:36:03 MSD 2003</dd><dt>date </dt><dd>Thu Jul 10 21:36:22 2003</dd><dt><a href="http://namesys.com/intbenchmarks/mongo/03.07.11.light/dot.config">.config</a></dt></dl> <table cols="19" cellpadding="2" cellspacing="2" noborder=""> <tbody><tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">A.INFO_R4=test FSTYPE=reiser4 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">B.INFO_R4=test FSTYPE=reiser4 MKFS=mkfs.reiser4 -q -e extent40 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">C.FSTYPE=reiserfs </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">D.FSTYPE=reiserfs MOUNT_OPTIONS=notail </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">E.FSTYPE=ext3 </font></th> </tr> <tr> <th bgcolor="#303030" colspan="19" align="left"><font color="white">F.FSTYPE=ext3 MOUNT_OPTIONS=data=journal </font></th> </tr> <tr> <td colspan="19" bgcolor="#606060"><b><font color="white">#0:DD_MBCOUNT=768 </font></b></td></tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td colspan="6"><b>REAL_TIME</b></td> <td colspan="6"><b>CPU_TIME</b></td> <td colspan="6"><b>DF</b></td> </tr> <tr align="center" bgcolor="#c0c0c0"> <td></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> <td><b>A</b></td><td><b>B/A </b></td><td><b>C/A </b></td><td><b>D/A </b></td><td><b>E/A </b></td><td><b>F/A </b></td> </tr> <tr> <td bgcolor="#c0c0c0"><b>dd_writing_largefile</b></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 76.29</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 0.997</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.137 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.149 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.062 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 2.217 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt>7.47</tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="red"> 1.027 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.545</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> <u> 0.549</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.803 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="green"> 0.835 </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><u> 786432</u></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.000</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> <td bgcolor="#e0e0c0" align="right"><tt><font color="black"> <u> 1.001</u> </font></tt></td> </tr> <tr><td bgcolor="black" colspan="19"><font color="white"></font></td></tr> <tr><td colspan="19" align="right"> </td></tr><tr> <td colspan="19" bgcolor="#303030"><b><font color="white">NPROC=1 DIR=/mnt/testfs SYNC=off REP_COUNTER=3 GAMMA=0.0 DD_MBCOUNT=768 DEV=/dev/hdb3 WRITE_BUFFER=4096 FILE_SIZE=8000 BYTES=128000000 </font></b></td></tr> <tr><td colspan="19" align="right"> <font size="-2">Produced by <a href="http://namesys.com/benchmarks/mongo_readme.html">Mongo</a> benchmark suite.</font></td></tr> </tbody></table> <hr> <a name="bonnie++.2003.09.30"> This is bonnie++ output for reiser4 and ext3. This has been done in an attempt to analyze <a href="http://fsbench.netnation.com/">results</a> obtained by Mike Benoit. Hardware specs: <pre> processor : 3 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 7 cpu MHz : 2379.253 cache size : 512 KB bogomips : 4751.36 </pre> Dual CPU with hyper-threading Memory: 128M HDD: <pre> # hdparm /dev/hdb1 /dev/hdb1: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 65535/16/63, sectors = 117226242, start = 63 # hdparm -t /dev/hdb1 /dev/hdb1: Timing buffered disk reads: 64 MB in 1.60 seconds = 39.91 MB/sec # hdparm -i /dev/hdb /dev/hdb: Model=ST360021A, FwRev=3.19, SerialNo=3HR173RB Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=2048kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=117231408 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: 1 2 3 4 5 </pre> <pre> ./bonnie++ -s 1g -n 10 -x 5 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4.128M 1G 19903 89 37911 20 15392 11 13624 58 41807 12 131.0 0 v4.128M 1G 19965 89 37600 20 15845 11 13730 58 41751 12 130.0 0 v4.128M 1G 19937 89 37746 20 15404 11 13624 58 41793 12 132.1 0 v4.128M 1G 19998 89 37184 19 15007 10 13393 56 41611 11 130.2 0 v4.128M 1G 19771 89 37679 20 15206 11 13466 57 41808 11 130.2 1 ext3.128M 1G 21236 99 37258 22 11357 4 13460 56 41748 6 120.0 0 ext3.128M 1G 20821 99 36838 23 12176 5 13154 55 40671 6 120.7 0 ext3.128M 1G 20755 99 37032 24 12069 4 12908 54 40851 5 120.2 0 ext3.128M 1G 20651 99 37094 24 11817 5 13038 54 40842 6 121.3 0 ext3.128M 1G 20928 99 37300 23 12287 4 13067 55 41404 6 120.1 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4.128M 10 18503 100 +++++ +++ 9488 99 10158 99 +++++ +++ 11635 99 v4.128M 10 19760 99 +++++ +++ 9696 99 10441 100 +++++ +++ 11831 99 v4.128M 10 19583 100 +++++ +++ 9672 100 10597 99 +++++ +++ 11846 100 v4.128M 10 19720 100 +++++ +++ 9577 99 10126 100 +++++ +++ 11924 100 v4.128M 10 19682 100 +++++ +++ 9683 100 10461 100 +++++ +++ 11834 100 ext3.128M 10 3279 97 +++++ +++ +++++ +++ 3406 100 +++++ +++ 8951 95 ext3.128M 10 3303 98 +++++ +++ +++++ +++ 3423 99 +++++ +++ 8558 96 ext3.128M 10 3317 98 +++++ +++ +++++ +++ 3402 100 +++++ +++ 8721 93 ext3.128M 10 3325 98 +++++ +++ +++++ +++ 3390 100 +++++ +++ 9242 100 ext3.128M 10 3315 97 +++++ +++ +++++ +++ 3439 100 +++++ +++ 8896 96 </pre> <pre> ./bonnie++ -f -d . -s 3072 -n 10:100000:10:10 -x 1 Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v4 3G 37579 19 15657 11 41531 11 105.8 0 v4 3G 37993 20 15478 11 41632 11 105.4 0 ext3 3G 35221 22 10987 4 41105 6 90.9 0 ext3 3G 35099 22 11517 4 41416 6 90.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP v4 10:100000:10/10 570 39 746 17 1435 23 513 40 104 2 951 15 v4 10:100000:10/10 566 40 765 17 1385 23 509 41 104 2 904 14 ext3 10:100000:10/10 221 8 364 4 853 4 204 7 99 1 306 2 ext3 10:100000:10/10 221 7 368 4 839 5 206 7 91 1 309 2 </pre> <hr> <a name="grant"></a> Benchmarks performed by <a href="mailto:mine0057@mrs.umn.edu">Grant Miner</a>. He used <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/bench.scm">bench.scm</a> script (requires <a href="http://www.scsh.net/">scsh</a>). Results (copied from <a href="http://epoxy.mrs.umn.edu/~minerg/fstests/results.html">http://epoxy.mrs.umn.edu/~minerg/fstests/results.html</a>): <p>2.6.0-test3</p> <p>mkfs ran with default options</p> <p>Each test has three columns. First is a canoninical name of the test, with time test took in seconds. Second column is system cpu time. Third column is user cpu time. Last column "total" is total time; sys is total sys time; usr is total usr time; total cpu is sum of total sys time and total usr time. </p> <p><b>all values are in seconds thus lower is better</b></p> <table border cellspacing=0 cellpadding=5> <caption>Filesystem Performance</caption> <colgroup> <col> <col bgcolor="gray"> </colgroup> <tr> <th>fs</th> <td bgcolor="lightgray">bigdir</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp4</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">cp5</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm2</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">rm3</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">sync</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total</td> <td>sys</td> <td>usr</td> <td bgcolor="lightgray">total cpu</td> <th>fs</th> </tr> <tr> <th>reiserfs</th> <td bgcolor="lightgray">40.03</td> <td>12.22</td> <td>0.76</td> <td bgcolor="lightgray">77.75</td> <td>10.72</td> <td>0.45</td> <td bgcolor="lightgray">62.9</td> <td>10.82</td> <td>0.43</td> <td bgcolor="lightgray">60.26</td> <td>11.03</td> <td>0.43</td> <td bgcolor="lightgray">61.33</td> <td>11.13</td> <td>0.43</td> <td bgcolor="lightgray">66.08</td> <td>11.31</td> <td>0.45</td> <td bgcolor="lightgray">10.86</td> <td>3.74</td> <td>0.07</td> <td bgcolor="lightgray">4.62</td> <td>3.36</td> <td>0.09</td> <td bgcolor="lightgray">8.22</td> <td>3.5</td> <td>0.09</td> <td bgcolor="lightgray">1.78</td> <td>0.03</td> <td>0.</td> <td bgcolor="lightgray">393.83</td> <td>77.86</td> <td>3.2</td> <td bgcolor="lightgray">81.06</td> <th>reiserfs</th> </tr> <tr> <th>jfs</th> <td bgcolor="lightgray">47.2</td> <td>8.9</td> <td>0.77</td> <td bgcolor="lightgray">109.75</td> <td>5.5</td> <td>0.3</td> <td bgcolor="lightgray">110.71</td> <td>5.49</td> <td>0.35</td> <td bgcolor="lightgray">114.69</td> <td>5.6</td> <td>0.29</td> <td bgcolor="lightgray">117.97</td> <td>5.65</td> <td>0.35</td> <td bgcolor="lightgray">125.48</td> <td>5.82</td> <td>0.29</td> <td bgcolor="lightgray">38.68</td> <td>0.74</td> <td>0.05</td> <td bgcolor="lightgray">16.25</td> <td>1.08</td> <td>0.07</td> <td bgcolor="lightgray">37.46</td> <td>0.74</td> <td>0.04</td> <td bgcolor="lightgray">0.07</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">718.26</td> <td>39.52</td> <td>2.51</td> <td bgcolor="lightgray">42.03</td> <th>jfs</th> </tr> <tr> <th>xfs</th> <td bgcolor="lightgray">44.77</td> <td>13.3</td> <td>0.94</td> <td bgcolor="lightgray">105.36</td> <td>13.33</td> <td>0.53</td> <td bgcolor="lightgray">110.27</td> <td>14.36</td> <td>0.5</td> <td bgcolor="lightgray">110.17</td> <td>14.37</td> <td>0.51</td> <td bgcolor="lightgray">111.03</td> <td>14.43</td> <td>0.53</td> <td bgcolor="lightgray">118.84</td> <td>14.87</td> <td>0.55</td> <td bgcolor="lightgray">31.85</td> <td>6.44</td> <td>0.15</td> <td bgcolor="lightgray">15.2</td> <td>5.45</td> <td>0.14</td> <td bgcolor="lightgray">34.32</td> <td>5.87</td> <td>0.14</td> <td bgcolor="lightgray">0.03</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">681.84</td> <td>102.42</td> <td>3.99</td> <td bgcolor="lightgray">106.41</td> <th>xfs</th> </tr> <tr> <th>reiser4</th> <td bgcolor="lightgray">33.51</td> <td>10.85</td> <td>0.69</td> <td bgcolor="lightgray">33.9</td> <td>10.65</td> <td>0.65</td> <td bgcolor="lightgray">32.9</td> <td>10.79</td> <td>0.67</td> <td bgcolor="lightgray">34.</td> <td>10.87</td> <td>0.65</td> <td bgcolor="lightgray">33.62</td> <td>10.87</td> <td>0.69</td> <td bgcolor="lightgray">31.31</td> <td>10.83</td> <td>0.76</td> <td bgcolor="lightgray">17.45</td> <td>4.07</td> <td>0.3</td> <td bgcolor="lightgray">11.54</td> <td>4.49</td> <td>0.3</td> <td bgcolor="lightgray">13.08</td> <td>4.27</td> <td>0.27</td> <td bgcolor="lightgray">0.52</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">241.83</td> <td>77.69</td> <td>4.98</td> <td bgcolor="lightgray">82.67</td> <th>reiser4</th> </tr> <tr> <th>ext3</th> <td bgcolor="lightgray">38.79</td> <td>9.35</td> <td>0.7</td> <td bgcolor="lightgray">91.57</td> <td>7.21</td> <td>0.36</td> <td bgcolor="lightgray">62.6</td> <td>7.44</td> <td>0.36</td> <td bgcolor="lightgray">62.74</td> <td>7.5</td> <td>0.37</td> <td bgcolor="lightgray">60.62</td> <td>7.52</td> <td>0.34</td> <td bgcolor="lightgray">69.82</td> <td>7.59</td> <td>0.39</td> <td bgcolor="lightgray">26.21</td> <td>1.67</td> <td>0.05</td> <td bgcolor="lightgray">8.73</td> <td>1.66</td> <td>0.04</td> <td bgcolor="lightgray">13.79</td> <td>1.63</td> <td>0.06</td> <td bgcolor="lightgray">4.76</td> <td>0.01</td> <td>0.</td> <td bgcolor="lightgray">439.63</td> <td>51.58</td> <td>2.67</td> <td bgcolor="lightgray">54.25</td> <th>ext3</th> </tr> <tr> <th>ext2</th> <td bgcolor="lightgray">32.78</td> <td>7.61</td> <td>0.64</td> <td bgcolor="lightgray">37.28</td> <td>5.24</td> <td>0.34</td> <td bgcolor="lightgray">43.55</td> <td>5.34</td> <td>0.35</td> <td bgcolor="lightgray">45.41</td> <td>5.34</td> <td>0.37</td> <td bgcolor="lightgray">47.72</td> <td>5.48</td> <td>0.34</td> <td bgcolor="lightgray">50.5</td> <td>5.41</td> <td>0.32</td> <td bgcolor="lightgray">16.28</td> <td>0.67</td> <td>0.06</td> <td bgcolor="lightgray">7.54</td> <td>0.66</td> <td>0.05</td> <td bgcolor="lightgray">15.31</td> <td>0.71</td> <td>0.05</td> <td bgcolor="lightgray">0.24</td> <td>0.</td> <td>0.</td> <td bgcolor="lightgray">296.61</td> <td>36.46</td> <td>2.52</td> <td bgcolor="lightgray">38.98</td> <th>ext2</th> </tr> </table> <hr> </body> </html> <hr> <address><a href="mailto:reiser@namesys.com">Hans Reiser</a></address> <!-- Created: Sat Aug 23 00:28:46 MSD 2003 --> <!-- hhmts start --> Last modified: Thu Nov 20 17:51:10 MSK 2003 <!-- hhmts end --> </body> <SCRIPT language="Javascript"> <!-- // FILE ARCHIVED ON 20061113154648 AND RETRIEVED FROM THE // INTERNET ARCHIVE ON 20090625075531. // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). var sWayBackCGI = "http://web.archive.org/web/20061113154648/"; function xResolveUrl(url) { var image = new Image(); image.src = url; return image.src; } function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { var url = aCollection[i][sProp]; if (typeof(url) == "string") { if (url.indexOf("mailto:") == -1 && url.indexOf("javascript:") == -1 && url.length > 0) { if(url.indexOf("http") != 0) { url = xResolveUrl(url); } url = url.replace('.wstub.archive.org',''); aCollection[i][sProp] = sWayBackCGI + url; } } } } xLateUrl(document.getElementsByTagName("IMG"),"src"); xLateUrl(document.getElementsByTagName("A"),"href"); xLateUrl(document.getElementsByTagName("AREA"),"href"); xLateUrl(document.getElementsByTagName("OBJECT"),"codebase"); xLateUrl(document.getElementsByTagName("OBJECT"),"data"); xLateUrl(document.getElementsByTagName("APPLET"),"codebase"); xLateUrl(document.getElementsByTagName("APPLET"),"archive"); xLateUrl(document.getElementsByTagName("EMBED"),"src"); xLateUrl(document.getElementsByTagName("BODY"),"background"); xLateUrl(document.getElementsByTagName("TD"),"background"); xLateUrl(document.getElementsByTagName("INPUT"),"src"); var forms = document.getElementsByTagName("FORM"); if (forms) { var j = 0; for (j = 0; j < forms.length; j++) { f = forms[j]; if (typeof(f.action) == "string") { if(typeof(f.method) == "string") { if(typeof(f.method) != "post") { f.action = sWayBackCGI + f.action; } } } } } //--> </SCRIPT> </html> [[category:ReiserFS]] 432633776b9f7027663194b93b2e5284faeafa95 News Archive 0 171 4372 4299 2020-07-30T18:34:08Z Chris goe 2 archived 2018-06-27 - Reiser4 for Linux-4.17 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-04-05 - Reiser4 for Linux-4.16 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2018-03-28 - Reiser4 for Linux-4.15 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-11-26 - [https://www.spinics.net/lists/reiserfs-devel/msg05650.html reiser4: port for Linux-4.14] released 2017-09-06 - Reiser4 for Linux-4.13 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2017-08-14 - [https://www.spinics.net/lists/reiserfs-devel/msg05603.html Reiser4 for Linux-4.12] released 2017-06-03 - [https://www.spinics.net/lists/reiserfs-devel/msg05519.html Reiser4: Port for Linux-4.11] released 2017-02-21 - [https://www.spinics.net/lists/reiserfs-devel/msg05385.html reiser4: port for Linux-4.10] released 2016-12-17 - Reiser4 for Linux-4.9 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [https://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [https://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [https://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - Jeff Mahoney [https://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 created] a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! 2012-09-11 - [https://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [https://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [https://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [https://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted <s>[https://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [https://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.35] - Please [https://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [https://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [https://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [https://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [https://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [https://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] 48b85604dac7383a218027583a37f8c152647dac 4299 4147 2018-09-18T18:59:43Z Chris goe 2 moved from main page 2016-12-17 - Reiser4 for Linux-4.9 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-11-16 - Reiser4 for Linux-4.8 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-09-24 - Reiser4 mirrors and failover [https://www.spinics.net/lists/reiserfs-devel/msg05174.html announced] 2016-09-24 - [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Edward created] Git trees for [[Reiser4progs|libaal and reiser4progs]] and for [https://github.com/edward6/reiser4 fs/reiser4] - yay! 2016-08-09 - Reiser4 for Linux-4.7 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-06-06 - [[reiserfsprogs]] v3.6.25 has been [http://www.spinics.net/lists/reiserfs-devel/msg05096.html released] 2016-05-20 - Reiser4 for Linux-4.6 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-05-06 - Reiser4 for Linux-4.5.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-03-30 - Reiser4 for Linux-4.5 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2016-01-13 - Reiser4 for Linux-4.4 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-11-16 - Reiser4 for Linux-4.3 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-09-22 - Reiser4 for Linux-4.2 [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-08-31 - Reiser4 format 4.0.1 [https://marc.info/?l=reiserfs-devel&m=144103447029219&w=2 released] (with [[Reiser4 checksums|(meta)data checksums]]) 2015-08-07 - Reiser4 for Linux-4.1 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-05-05 - Reiser4 for Linux-4.0 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/ released] 2015-04-20 - Reiser4 for Linux-3.19 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [https://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [https://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [https://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - Jeff Mahoney [https://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 created] a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! 2012-09-11 - [https://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [https://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [https://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [https://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted <s>[https://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [https://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.35] - Please [https://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [https://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [https://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [https://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [https://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [https://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] 64947877a4f798424357bbefa1e25834dba914c9 4147 4064 2016-06-03T01:57:50Z Chris goe 2 adding 2014 2014-10-26 - Reiser4 for Linux-3.17 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-10-13 - Reiser4 for Linux-3.16 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-08-24 - Reiser4 for Linux-3.15 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-06-29 - [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6] and [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9] released. 2014-05-09 - Glenn [https://www.spinics.net/lists/reiserfs-devel/msg03897.html provided] new [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled openSUSE kernels] 2014-05-07 - Reiser4 for Linux-3.14 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2014-05-06 - Ivan Shapovalov [https://marc.info/?l=reiserfs-devel&m=139935424207357&w=2 announced] discard support in Reiser4 2014-04-23 - Jeff Mahoney published a big [https://www.spinics.net/lists/reiserfs-devel/msg03814.html reiserfs cleanup patchset] 2014-03-11 - Different transaction models in Reiser4 [https://marc.info/?l=reiserfs-devel&m=139449965000686&w=2 announced] 2014-02-05 - Reiser4 for Linux-3.13 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [https://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [https://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [https://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - Jeff Mahoney [https://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 created] a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! 2012-09-11 - [https://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [https://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [https://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [https://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted <s>[https://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [https://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.35] - Please [https://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [https://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [https://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [https://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [https://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [https://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] ca77edfb175a9b893c4de1cc7edc5893d1079ede 4064 4058 2015-08-17T23:57:28Z Chris goe 2 https everywhere!; (some) 404s fixed 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [https://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [https://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [https://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - Jeff Mahoney [https://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 created] a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! 2012-09-11 - [https://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [https://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [https://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - [https://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted <s>[https://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [https://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.35] - Please [https://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [https://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [https://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [https://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-2.6/ Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [https://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [https://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] 4c8bf5b06aa785c5147cadac13dcd65937444247 4058 3851 2015-04-20T09:16:08Z Chris goe 2 2013-12-20 - Reiser4 for Linux-3.12 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-09-23 - Reiser4 for Linux-3.11 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-16 - Reiser4 for Linux-3.10 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-07-01 - [http://www.spinics.net/lists/reiserfs-devel/msg03466.html reiserfsprogs 3.6.23 released] 2013-05-25 - Reiser4 for Linux-3.9.2 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-05-04 - reiser4progs-1.0.8 [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs/ released] 2013-04-05 - Reiser4 for Linux-3.8.5 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released] 2013-01-07 - Reiser4 for Linux-3.7.1 [http://marc.info/?l=reiserfs-devel&m=135750493615146&w=2 released]. 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - Jeff Mahoney [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 created] a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - <s>[http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] a6a16f715272f38bd7eab9a112cd9b301750299c 3851 3601 2014-05-09T09:00:26Z Chris goe 2 2012 2012-10-28 - Reiser4 for Linux-3.6.4 [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2012-10-16 - [http://www.spinics.net/lists/reiserfs-devel/msg03276.html reiserfsprogs 3.6.22 released] 2012-10-10 - Jeff Mahoney [http://marc.info/?l=reiserfs-devel&m=134988188217051&w=2 created] a [https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/ new home for reiserfsprogs] and a [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git Git repository] too! 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - <s>[http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] e0deaadef8f9fe2dded5f87a180b083a28243b99 3601 2341 2013-07-03T17:53:16Z Chris goe 2 archived 2012-09-11 - [http://www.spinics.net/lists/reiserfs-devel/msg03233.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_12.2 Reiser4-enabled kernel RPMs] for openSUSE 12.2. 2012-09-08 - [http://www.spinics.net/lists/reiserfs-devel/msg03230.html Reiser4 for Linux-3.5.3] [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ released]. 2011-04-08 - [http://www.spinics.net/lists/reiserfs-devel/msg02830.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.4 Reiser4-enabled kernel RPMs] for openSUSE 11.4. 2011-04-03 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.38] have been released. 2011-01-26 - Patches for [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 2.6.37] have been released. 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - <s>[http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] a5a80fe454433e25c28c02466e49784fdcdf734a 2341 2012-09-16T08:05:32Z Chris goe 2 from: [[Main_Page]] 2010-11-20 - Edward released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ Reiser4 patches for Linux 2.6.36] - Thanks! 2010-10-18 - <s>[http://www.spinics.net/lists/reiserfs-devel/msg02494.html Viji V Nair] posted [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Reiser4-enabled kernel RPMs] for Fedora 13.</s> - it's gone :( :::Please test and post results to the [[Mailinglists|list]]! 2010-10-14 - [http://www.spinics.net/lists/reiserfs-devel/msg02493.html Glenn] posted [https://build.opensuse.org/package/binaries?package=kernel-reiser4&project=home%3Adoiggl&repository=openSUSE_11.3 Reiser4-enabled kernel RPMs] for OpenSUSE 11.3. :::Please test and post results to the [[Mailinglists|list]]! 2010-08-04 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.35] - Please [http://www.spinics.net/lists/reiserfs-devel/msg02373.html test]! 2010-05-26 - Apparently [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.34] ::: - Testers welcome! :-) 2010-04-27 - A benchmark of reiser4 was published on [http://www.phoronix.com/scan.php?page=article&item=reiser4_benchmarks&num=1 Phoronix]. 2010-03-04 - [http://chichkin_i.zelnet.ru Edward] released [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.33] - Thanks! 2010-02-15 - The [http://git.zen-kernel.org/?p=kernel/zen.git;a=shortlog;h=refs/heads/reiser4 Zen Kernel] works for [http://nerdbynature.de/benchmarks/v40z/2010-02-15/bonnie.html 2.6.33 too] - hey! :-) 2009-11-24 - [http://zen-kernel.org/ zen-kernel.org] <small>(also hosting the [[Reiser4_patchsets|MMOTM kernel]])</small> ships with Reiser4 :::and [http://www.spinics.net/lists/reiserfs-devel/msg01999.html is said to work] for [http://downloads.zen-kernel.org/2.6.32/ 2.6.32] 2009-11-10 - [http://www.phoronix.com/scan.php?page=news_item&px=NzY4OQ Reiser4 May Go For Mainline Inclusion In 2010] 2009-10-26 - <s>Viji V Nair [http://www.spinics.net/lists/reiserfs-devel/msg01957.html released] Fedora 11 kernel [http://fedoraproject.org/wiki/User:Viji#Fedora_kernel_rpm_with_reiser4_support RPMs with Reiser4 support]</s> - it's gone :( 2009-10-05 - Edward [http://marc.info/?l=reiserfs-devel&m=125470523000355&w=2 released] Reiser4 [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches for 2.6.31] - please test! 2009-09-11 - Reiser4 patches for 2.6.30 are [http://kerneltrap.org/mailarchive/reiserfs-devel/2009/9/11/6399383 said to work] for :::the recently released 2.6.31 kernel as well. 2009-06-22 - [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D Reiser4 patches for Linux 2.6.30 released!] 2009-04-25 - [[TODO]] list updated, we're now down to 5 open issues. 2009-01-17 - [http://lwn.net/Articles/315509/ Reiser4progs-1.0.7 released] 2009-01-10 - [http://lwn.net/Articles/314451/ Reiserfsprogs-3.6.21 released] 2009-01-08 - Reiser4 kernel packages for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] and [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ 11.1] have been built. 2007-04-25 - [http://kerneltrap.org/node/8102 Reiser4's future] 2005-05-15 - The [http://grml.org/ grml] recovery CD comes with [http://grml.org/changelogs/README-0.4.txt Reiser4 support] 2398be5b42f75ebff469c71d1e21bed764af35ae PreciseDiscard 0 1082 4121 4119 2016-02-12T18:22:31Z Edward 4 /* Precise real-time discard in Reiser4 for SSD devices */ == Precise real-time discard in Reiser4 for SSD devices == Reiser4 possesses a unique feature - an efficient implementation of real-time discard which doesn't lead to garbage accumulation on disk, and, hence, rids of need to periodically run fstrim (batch discard) on the device. == Introduction. Garbage on SSD disks == Real-time [http://en.wikipedia.org/wiki/Trim_(computing) discard support] means that file system issues discard requests, i.e. informs the block layer about extents of freed space. Currently all Linux file systems with announced feature of real-time discard support issue so-called "lazy" (or "non-precise") discard requests. It means that such file systems report exactly about blocks that were freed. Since erase units in general don't coincide with file system blocks, such "lazy" technique leads to accumulation of garbage on disk. '''DEFINITION'''. ''Garbage is a set of erase units on disk, which were deallocated (marked free in the file system space map), but discard requests for them were not issued for some reasons.'' Indeed, for example, if erase unit is larger than file system block, then it can happen that "lazy" discard request contains partial erase units, so that the block layer will round up the start and round down the end of such discard request. This is because on the one hand trim operation is defined only for whole erase units. On the other hand, the block layer doesn't know the status of erase unit, which is freed only partially, and hence it makes an assumption that it its other part is "busy" in the file system's space map (the alternative assumption can lead to data corruption). Note, however, that if such "forced" assumption is incorrect, and the whole erase unit becomes free, then such erase unit will become a garbage. With lazy discard policy user needs once in a while to run special tools (e.g fstrim(8)) to clean up the accumulated garbage. So, it would be nice to check the status of partially freed erase units and issue discard request for such unit, if its other part iDEFINITIONs also free (marked as free in the file system's space map). The block layer is not able to perform such checks for obvious reasons: this is a business of the file system. Below we prove that such checking of partially freed erase units and issuing discard requests for the padded extents (we'll call it "precise discard requests") doesn't lead to accumulation of garbage. Efficient issuing precise discard requests without performance drop and ugly workarounds is possible only if the file system possesses an advanced transaction manager like the one of Reiser4. Initial idea of precise discard and its implementation of complexity N_u (where N_u is total number of erase units (including partial ones) in the resulted set of sorted and merged discard requests) belongs to Ivan Shapovalov. Edward Shishkin suggested implementation of complexity 2*N_e, where N_e is total number of extents in such resulted set. == (De)allocation, discard units and alignment. Non-precise and precise coordinates == The minimal unit of all (de)allocation operations in a file system is a file system block of blk_size. The minimal unit of all discard operations is a so-called erase unit of EUS size. Every file system block can be addressed by its (block) number. In this case we'll say about addressing in the system of non-precise coordinates 0Y. In contrast with non-precise coordinates we'll also consider a system 0X of precise coordinates, where every individual byte on the disk can be addressed. In the system 0Y we'll consider (non-precise) extents of blocks (U,V), where U is the number of the start block, and V is the width of the extent (in blocks). In the system 0X we'll consider precise extents of bytes [AB], where A (A < B) is offset of the first byte and (B-1) is offset of the last byte of the extent. So, the length of such segment is B-A. Erase unit size in bytes (EUS) is a property of SSD drive. Generally erase units don't coincide with file system blocks, so we'll address erase units in the system OX of precise coordinates by precise extents. In particular, every erase unit of some SSD partition is represented in precise coordinates as extent [EUO + N * EUS, EUO + (N+1)*EUS] for some natural N, where EUO is the offset of the first complete erase unit (0 <= EUO < EUS). That is, EUO is a property of individual partitions of SSD drives. EUO is also called as "alignment". == Lazy (non-precise) discard policy. Accumulation of garbage on disk == The policy of lazy (non-precise) discard is rather simple: if any extent of blocks (U,V) is freed by the file system, then we issue discard request for the extent [U * blk_size, (U+V) * blk_size]. Suppose now that EUS != blk_size, or EUO != 0 Suppose, the file system deallocates an extent (2,5) and issue the respective "lazy" discard request [2 * blk_size, 7 * blk_size], see the picture below. The block layer assumes that the neighboring blocks #1 and #7 are busy, and, hence, issues discard request for the smaller segment [AB]. Note, however, that if this assumption is incorrect, and blocks #1 and (or) #7 were actually free, then after freeing the extent (2,5) we'll have that the whole erase units [A - EUS, A] and [B, B + EUS] are marked as free in the file system space map, and hence, will replenish the garbage. So, the lazy (non-precise) discard policy leads to accumulation of garbage on disk. * * * * * * * * * > Y 0 1 2 3 4 5 6 7 8 0 blk_size 3*blk_size *-------*-------*-------*-------*-------*-------*-------*-------*--> X ---+--------+--------+--------+--------+--------+--------+--------+> X 0 EUO A-EUS A B B+EUS Comment. There are 2 independent "sources" of garbage in "lazy" discard policy: * "bad" values of erase unit size (EUS != blk_size); * "bad" values of alignment (EUO != 0); == Precise discard policy == The idea is to check all "partially deallocated" erase units. If the whole such unit is marked as free in the file system space map, then we (file system) issue a discard request for the whole unit. That is, in contrast with the lazy discard policy, the file system provides correct status of every partially deallocated discard unit and issues precise discard request for the larger (padded) extents. Let's consider the previous example. In accordance with the precise discard policy file system checks the status of blocks #1 and #7. If both blocks #1 and #7 are free, then file system issues a discard request [A - EUS, B + EUS]. If block #1 is free and block #7 is busy, then file system issues a discard request [A - EUS, B]. If block #1 is busy and block #7 is free, then the file system will issue discard request [A, B + EUS]. Finally, if both blocks #1 and #7 are busy, then the file system issues discard request [A, B]. Note that block layer won't restrict such "precise" discard requests, and, moreover, the following statement takes place: '''THEOREM'''. ''The policy of precise discard doesn't lead to accumulation of garbage on disk''. Proof (sketch). Indeed, suppose that disk doesn't contain "garbage". That is dicard request was issued for every erase unit, which is marked as free in the file system space map. Suppose, the file system deallocates extent (2, 5). If the block #1 is busy and block #7 is free, then, in accordance with precise discard policy, the file system issues "precise" discard request [A, B + EUS]. Note that we must not discard the unit [A - EUS, A], since it contains bytes of the busy block #1. Also, note that we don't need to discard other units due to the assumption, that before deallocation disk didn't contain garbage. Thus, we have that discard requests have been issued for every erase unit, which is marked as free in the file system space map. By the similar way we can prove that "precise" discard policy doesn't leave garbage on disk in other 3 cases (when block #1 is free and block #7 is busy, both blocks #1 and #7 are free, and both blocks #1 and #7 are busy. == Implementation of precise discard == The straightforward solution is to check the status of partially deallocated erase units in the file system's space map. However, efficient implementation of such solution requires an advanced transaction manager and not less advanced block allocator. In particular, you need to make sure that nobody will occupy the other parts of your partially deallocated erase units while you are issuing precise discard requests for them (otherwise, data corruption is possible). Reiser4 block allocator manages the following in-memory data-structures: * working space map (W) * commit space map (C) * deallocation set (D) Allocation in Reise4 is always going on the working space map: (1) W' = alloc(W, R); - allocate a set R of block numbers in the working space map W. Deallocation is a bit more complicated: all freed block numbers at first are recorded in a special data structure - deallocation set D: (2) D' = dealloc(D, R); Before committing a transaction we update the commit space map C at so-called pre_commit_hook(): (3) C' = apply(C, D); After committing the transaction, that is after issuing all write requests (including the commit space map C) we prepare and issue discard requests in so-called post_write_back_hook(): (4) prepare_and_issue_discard_requests(); After issuing discard requests we update the working space map: (5) W' = apply(W, D'); == Handling paddings of partially freed erase units == When preparing discard set at stage (4) we check head (tail) padding of every partial erase unit. If it is free, we allocate it at the working space map: W" = alloc(W', R'); At the same time we record the allocated paddings to the deallocation set: D" = dealloc(D', R'); Updating the working space map at the stage (5) automatically deallocates the paddings: apply(W", D") = W" \ R' = (W' + R')\ R'= W'. == How to test == Find out erase unit size and alignment of your SSD partition. Apply the [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz patch] against reiser4-for-3.17.3. Format a reiser4 partition with reiser4progs-1.0.9. Use mkfs.reiser4 option -d to "discard" the whole partition on your SSD drive at format time. We recommend to use transparent compression for SSD drives (by default it is turned on when formatting with mkfs.reiser4). Mount a reiser4 partition with mount option "discard", specifying (in bytes) erase unit size and alignment of your partition by mount options "discard.unit" and "discard.offset" respectively. For example, if erase unit size of your partition is 1536K and alignment is 0, then the mount command should look like the following: mount /dev/sdX -o discard,discard.unit=1572864,discard.offset=0 /mnt '''WARNING: Incorrectly specified erase unit size and alignment will lead to data corruption!''' Find a kernel message about discard support: reiser4: sdX: enable discard support (erase unit 1572864 bytes, alignment 0 bytes) We recommend to use Write-Anywhere (AKA Copy-On-Write) transaction model for SSD drives (mount option "txmod=wa"). Also we recommend to use mount option "noatime" for SSD drives. [[category:Reiser4]] 19db2c8eeded3b200ba9d0f513072dc2b9acffb7 4119 4117 2016-02-12T18:12:32Z Edward 4 /* Introduction. Garbage on SSD disks */ == Precise real-time discard in Reiser4 for SSD devices == Efficient implementation of real-time discard which doesn't lead to accumulation of garbage on disk (set of erase units which are marked as free in the file system space map, but discard requests wasn't issued for them), and, hence, rids of need to periodically run fstrim (batch discard) on the device == Introduction. Garbage on SSD disks == Real-time [http://en.wikipedia.org/wiki/Trim_(computing) discard support] means that file system issues discard requests, i.e. informs the block layer about extents of freed space. Currently all Linux file systems with announced feature of real-time discard support issue so-called "lazy" (or "non-precise") discard requests. It means that such file systems report exactly about blocks that were freed. Since erase units in general don't coincide with file system blocks, such "lazy" technique leads to accumulation of garbage on disk. '''DEFINITION'''. ''Garbage is a set of erase units on disk, which were deallocated (marked free in the file system space map), but discard requests for them were not issued for some reasons.'' Indeed, for example, if erase unit is larger than file system block, then it can happen that "lazy" discard request contains partial erase units, so that the block layer will round up the start and round down the end of such discard request. This is because on the one hand trim operation is defined only for whole erase units. On the other hand, the block layer doesn't know the status of erase unit, which is freed only partially, and hence it makes an assumption that it its other part is "busy" in the file system's space map (the alternative assumption can lead to data corruption). Note, however, that if such "forced" assumption is incorrect, and the whole erase unit becomes free, then such erase unit will become a garbage. With lazy discard policy user needs once in a while to run special tools (e.g fstrim(8)) to clean up the accumulated garbage. So, it would be nice to check the status of partially freed erase units and issue discard request for such unit, if its other part iDEFINITIONs also free (marked as free in the file system's space map). The block layer is not able to perform such checks for obvious reasons: this is a business of the file system. Below we prove that such checking of partially freed erase units and issuing discard requests for the padded extents (we'll call it "precise discard requests") doesn't lead to accumulation of garbage. Efficient issuing precise discard requests without performance drop and ugly workarounds is possible only if the file system possesses an advanced transaction manager like the one of Reiser4. Initial idea of precise discard and its implementation of complexity N_u (where N_u is total number of erase units (including partial ones) in the resulted set of sorted and merged discard requests) belongs to Ivan Shapovalov. Edward Shishkin suggested implementation of complexity 2*N_e, where N_e is total number of extents in such resulted set. == (De)allocation, discard units and alignment. Non-precise and precise coordinates == The minimal unit of all (de)allocation operations in a file system is a file system block of blk_size. The minimal unit of all discard operations is a so-called erase unit of EUS size. Every file system block can be addressed by its (block) number. In this case we'll say about addressing in the system of non-precise coordinates 0Y. In contrast with non-precise coordinates we'll also consider a system 0X of precise coordinates, where every individual byte on the disk can be addressed. In the system 0Y we'll consider (non-precise) extents of blocks (U,V), where U is the number of the start block, and V is the width of the extent (in blocks). In the system 0X we'll consider precise extents of bytes [AB], where A (A < B) is offset of the first byte and (B-1) is offset of the last byte of the extent. So, the length of such segment is B-A. Erase unit size in bytes (EUS) is a property of SSD drive. Generally erase units don't coincide with file system blocks, so we'll address erase units in the system OX of precise coordinates by precise extents. In particular, every erase unit of some SSD partition is represented in precise coordinates as extent [EUO + N * EUS, EUO + (N+1)*EUS] for some natural N, where EUO is the offset of the first complete erase unit (0 <= EUO < EUS). That is, EUO is a property of individual partitions of SSD drives. EUO is also called as "alignment". == Lazy (non-precise) discard policy. Accumulation of garbage on disk == The policy of lazy (non-precise) discard is rather simple: if any extent of blocks (U,V) is freed by the file system, then we issue discard request for the extent [U * blk_size, (U+V) * blk_size]. Suppose now that EUS != blk_size, or EUO != 0 Suppose, the file system deallocates an extent (2,5) and issue the respective "lazy" discard request [2 * blk_size, 7 * blk_size], see the picture below. The block layer assumes that the neighboring blocks #1 and #7 are busy, and, hence, issues discard request for the smaller segment [AB]. Note, however, that if this assumption is incorrect, and blocks #1 and (or) #7 were actually free, then after freeing the extent (2,5) we'll have that the whole erase units [A - EUS, A] and [B, B + EUS] are marked as free in the file system space map, and hence, will replenish the garbage. So, the lazy (non-precise) discard policy leads to accumulation of garbage on disk. * * * * * * * * * > Y 0 1 2 3 4 5 6 7 8 0 blk_size 3*blk_size *-------*-------*-------*-------*-------*-------*-------*-------*--> X ---+--------+--------+--------+--------+--------+--------+--------+> X 0 EUO A-EUS A B B+EUS Comment. There are 2 independent "sources" of garbage in "lazy" discard policy: * "bad" values of erase unit size (EUS != blk_size); * "bad" values of alignment (EUO != 0); == Precise discard policy == The idea is to check all "partially deallocated" erase units. If the whole such unit is marked as free in the file system space map, then we (file system) issue a discard request for the whole unit. That is, in contrast with the lazy discard policy, the file system provides correct status of every partially deallocated discard unit and issues precise discard request for the larger (padded) extents. Let's consider the previous example. In accordance with the precise discard policy file system checks the status of blocks #1 and #7. If both blocks #1 and #7 are free, then file system issues a discard request [A - EUS, B + EUS]. If block #1 is free and block #7 is busy, then file system issues a discard request [A - EUS, B]. If block #1 is busy and block #7 is free, then the file system will issue discard request [A, B + EUS]. Finally, if both blocks #1 and #7 are busy, then the file system issues discard request [A, B]. Note that block layer won't restrict such "precise" discard requests, and, moreover, the following statement takes place: '''THEOREM'''. ''The policy of precise discard doesn't lead to accumulation of garbage on disk''. Proof (sketch). Indeed, suppose that disk doesn't contain "garbage". That is dicard request was issued for every erase unit, which is marked as free in the file system space map. Suppose, the file system deallocates extent (2, 5). If the block #1 is busy and block #7 is free, then, in accordance with precise discard policy, the file system issues "precise" discard request [A, B + EUS]. Note that we must not discard the unit [A - EUS, A], since it contains bytes of the busy block #1. Also, note that we don't need to discard other units due to the assumption, that before deallocation disk didn't contain garbage. Thus, we have that discard requests have been issued for every erase unit, which is marked as free in the file system space map. By the similar way we can prove that "precise" discard policy doesn't leave garbage on disk in other 3 cases (when block #1 is free and block #7 is busy, both blocks #1 and #7 are free, and both blocks #1 and #7 are busy. == Implementation of precise discard == The straightforward solution is to check the status of partially deallocated erase units in the file system's space map. However, efficient implementation of such solution requires an advanced transaction manager and not less advanced block allocator. In particular, you need to make sure that nobody will occupy the other parts of your partially deallocated erase units while you are issuing precise discard requests for them (otherwise, data corruption is possible). Reiser4 block allocator manages the following in-memory data-structures: * working space map (W) * commit space map (C) * deallocation set (D) Allocation in Reise4 is always going on the working space map: (1) W' = alloc(W, R); - allocate a set R of block numbers in the working space map W. Deallocation is a bit more complicated: all freed block numbers at first are recorded in a special data structure - deallocation set D: (2) D' = dealloc(D, R); Before committing a transaction we update the commit space map C at so-called pre_commit_hook(): (3) C' = apply(C, D); After committing the transaction, that is after issuing all write requests (including the commit space map C) we prepare and issue discard requests in so-called post_write_back_hook(): (4) prepare_and_issue_discard_requests(); After issuing discard requests we update the working space map: (5) W' = apply(W, D'); == Handling paddings of partially freed erase units == When preparing discard set at stage (4) we check head (tail) padding of every partial erase unit. If it is free, we allocate it at the working space map: W" = alloc(W', R'); At the same time we record the allocated paddings to the deallocation set: D" = dealloc(D', R'); Updating the working space map at the stage (5) automatically deallocates the paddings: apply(W", D") = W" \ R' = (W' + R')\ R'= W'. == How to test == Find out erase unit size and alignment of your SSD partition. Apply the [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz patch] against reiser4-for-3.17.3. Format a reiser4 partition with reiser4progs-1.0.9. Use mkfs.reiser4 option -d to "discard" the whole partition on your SSD drive at format time. We recommend to use transparent compression for SSD drives (by default it is turned on when formatting with mkfs.reiser4). Mount a reiser4 partition with mount option "discard", specifying (in bytes) erase unit size and alignment of your partition by mount options "discard.unit" and "discard.offset" respectively. For example, if erase unit size of your partition is 1536K and alignment is 0, then the mount command should look like the following: mount /dev/sdX -o discard,discard.unit=1572864,discard.offset=0 /mnt '''WARNING: Incorrectly specified erase unit size and alignment will lead to data corruption!''' Find a kernel message about discard support: reiser4: sdX: enable discard support (erase unit 1572864 bytes, alignment 0 bytes) We recommend to use Write-Anywhere (AKA Copy-On-Write) transaction model for SSD drives (mount option "txmod=wa"). Also we recommend to use mount option "noatime" for SSD drives. [[category:Reiser4]] 9ea0dd4b3e5ff97126cd3c4585e998b87fe7aafe 4117 4115 2016-02-12T18:07:34Z Edward 4 /* Introduction */ == Precise real-time discard in Reiser4 for SSD devices == Efficient implementation of real-time discard which doesn't lead to accumulation of garbage on disk (set of erase units which are marked as free in the file system space map, but discard requests wasn't issued for them), and, hence, rids of need to periodically run fstrim (batch discard) on the device == Introduction. Garbage on SSD disks == Real-time [http://en.wikipedia.org/wiki/Trim_(computing) discard support] means that file system issues discard requests, i.e. informs the block layer about extents of freed space. Currently all Linux file systems with announced feature of real-time discard support issue so-called "lazy" (or "non-precise") discard requests. It means that such file systems report exactly about blocks that were freed. Since erase units in general don't coincide with file system blocks, such "lazy" technique leads to accumulation of garbage on disk. '''DEFINITION'''. ''Garbage is a set of erase units on disk, which are marked free in the file system space map, but discard requests for them were not issued.'' Indeed, for example, if erase unit is larger than file system block, then it can happen that "lazy" discard request contains partial erase units, so that the block layer will round up the start and round down the end of such discard request. This is because on the one hand trim operation is defined only for whole erase units. On the other hand, the block layer doesn't know the status of erase unit, which is freed only partially, and hence it makes an assumption that it its other part is "busy" in the file system's space map (the alternative assumption can lead to data corruption). Note, however, that if such "forced" assumption is incorrect, and the whole erase unit becomes free, then such erase unit will become a garbage. With lazy discard policy user needs once in a while to run special tools (e.g fstrim(8)) to clean up the accumulated garbage. So, it would be nice to check the status of partially freed erase units and issue discard request for such unit, if its other part iDEFINITIONs also free (marked as free in the file system's space map). The block layer is not able to perform such checks for obvious reasons: this is a business of the file system. Below we prove that such checking of partially freed erase units and issuing discard requests for the padded extents (we'll call it "precise discard requests") doesn't lead to accumulation of garbage. Efficient issuing precise discard requests without performance drop and ugly workarounds is possible only if the file system possesses an advanced transaction manager like the one of Reiser4. Initial idea of precise discard and its implementation of complexity N_u (where N_u is total number of erase units (including partial ones) in the resulted set of sorted and merged discard requests) belongs to Ivan Shapovalov. Edward Shishkin suggested implementation of complexity 2*N_e, where N_e is total number of extents in such resulted set. == (De)allocation, discard units and alignment. Non-precise and precise coordinates == The minimal unit of all (de)allocation operations in a file system is a file system block of blk_size. The minimal unit of all discard operations is a so-called erase unit of EUS size. Every file system block can be addressed by its (block) number. In this case we'll say about addressing in the system of non-precise coordinates 0Y. In contrast with non-precise coordinates we'll also consider a system 0X of precise coordinates, where every individual byte on the disk can be addressed. In the system 0Y we'll consider (non-precise) extents of blocks (U,V), where U is the number of the start block, and V is the width of the extent (in blocks). In the system 0X we'll consider precise extents of bytes [AB], where A (A < B) is offset of the first byte and (B-1) is offset of the last byte of the extent. So, the length of such segment is B-A. Erase unit size in bytes (EUS) is a property of SSD drive. Generally erase units don't coincide with file system blocks, so we'll address erase units in the system OX of precise coordinates by precise extents. In particular, every erase unit of some SSD partition is represented in precise coordinates as extent [EUO + N * EUS, EUO + (N+1)*EUS] for some natural N, where EUO is the offset of the first complete erase unit (0 <= EUO < EUS). That is, EUO is a property of individual partitions of SSD drives. EUO is also called as "alignment". == Lazy (non-precise) discard policy. Accumulation of garbage on disk == The policy of lazy (non-precise) discard is rather simple: if any extent of blocks (U,V) is freed by the file system, then we issue discard request for the extent [U * blk_size, (U+V) * blk_size]. Suppose now that EUS != blk_size, or EUO != 0 Suppose, the file system deallocates an extent (2,5) and issue the respective "lazy" discard request [2 * blk_size, 7 * blk_size], see the picture below. The block layer assumes that the neighboring blocks #1 and #7 are busy, and, hence, issues discard request for the smaller segment [AB]. Note, however, that if this assumption is incorrect, and blocks #1 and (or) #7 were actually free, then after freeing the extent (2,5) we'll have that the whole erase units [A - EUS, A] and [B, B + EUS] are marked as free in the file system space map, and hence, will replenish the garbage. So, the lazy (non-precise) discard policy leads to accumulation of garbage on disk. * * * * * * * * * > Y 0 1 2 3 4 5 6 7 8 0 blk_size 3*blk_size *-------*-------*-------*-------*-------*-------*-------*-------*--> X ---+--------+--------+--------+--------+--------+--------+--------+> X 0 EUO A-EUS A B B+EUS Comment. There are 2 independent "sources" of garbage in "lazy" discard policy: * "bad" values of erase unit size (EUS != blk_size); * "bad" values of alignment (EUO != 0); == Precise discard policy == The idea is to check all "partially deallocated" erase units. If the whole such unit is marked as free in the file system space map, then we (file system) issue a discard request for the whole unit. That is, in contrast with the lazy discard policy, the file system provides correct status of every partially deallocated discard unit and issues precise discard request for the larger (padded) extents. Let's consider the previous example. In accordance with the precise discard policy file system checks the status of blocks #1 and #7. If both blocks #1 and #7 are free, then file system issues a discard request [A - EUS, B + EUS]. If block #1 is free and block #7 is busy, then file system issues a discard request [A - EUS, B]. If block #1 is busy and block #7 is free, then the file system will issue discard request [A, B + EUS]. Finally, if both blocks #1 and #7 are busy, then the file system issues discard request [A, B]. Note that block layer won't restrict such "precise" discard requests, and, moreover, the following statement takes place: '''THEOREM'''. ''The policy of precise discard doesn't lead to accumulation of garbage on disk''. Proof (sketch). Indeed, suppose that disk doesn't contain "garbage". That is dicard request was issued for every erase unit, which is marked as free in the file system space map. Suppose, the file system deallocates extent (2, 5). If the block #1 is busy and block #7 is free, then, in accordance with precise discard policy, the file system issues "precise" discard request [A, B + EUS]. Note that we must not discard the unit [A - EUS, A], since it contains bytes of the busy block #1. Also, note that we don't need to discard other units due to the assumption, that before deallocation disk didn't contain garbage. Thus, we have that discard requests have been issued for every erase unit, which is marked as free in the file system space map. By the similar way we can prove that "precise" discard policy doesn't leave garbage on disk in other 3 cases (when block #1 is free and block #7 is busy, both blocks #1 and #7 are free, and both blocks #1 and #7 are busy. == Implementation of precise discard == The straightforward solution is to check the status of partially deallocated erase units in the file system's space map. However, efficient implementation of such solution requires an advanced transaction manager and not less advanced block allocator. In particular, you need to make sure that nobody will occupy the other parts of your partially deallocated erase units while you are issuing precise discard requests for them (otherwise, data corruption is possible). Reiser4 block allocator manages the following in-memory data-structures: * working space map (W) * commit space map (C) * deallocation set (D) Allocation in Reise4 is always going on the working space map: (1) W' = alloc(W, R); - allocate a set R of block numbers in the working space map W. Deallocation is a bit more complicated: all freed block numbers at first are recorded in a special data structure - deallocation set D: (2) D' = dealloc(D, R); Before committing a transaction we update the commit space map C at so-called pre_commit_hook(): (3) C' = apply(C, D); After committing the transaction, that is after issuing all write requests (including the commit space map C) we prepare and issue discard requests in so-called post_write_back_hook(): (4) prepare_and_issue_discard_requests(); After issuing discard requests we update the working space map: (5) W' = apply(W, D'); == Handling paddings of partially freed erase units == When preparing discard set at stage (4) we check head (tail) padding of every partial erase unit. If it is free, we allocate it at the working space map: W" = alloc(W', R'); At the same time we record the allocated paddings to the deallocation set: D" = dealloc(D', R'); Updating the working space map at the stage (5) automatically deallocates the paddings: apply(W", D") = W" \ R' = (W' + R')\ R'= W'. == How to test == Find out erase unit size and alignment of your SSD partition. Apply the [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz patch] against reiser4-for-3.17.3. Format a reiser4 partition with reiser4progs-1.0.9. Use mkfs.reiser4 option -d to "discard" the whole partition on your SSD drive at format time. We recommend to use transparent compression for SSD drives (by default it is turned on when formatting with mkfs.reiser4). Mount a reiser4 partition with mount option "discard", specifying (in bytes) erase unit size and alignment of your partition by mount options "discard.unit" and "discard.offset" respectively. For example, if erase unit size of your partition is 1536K and alignment is 0, then the mount command should look like the following: mount /dev/sdX -o discard,discard.unit=1572864,discard.offset=0 /mnt '''WARNING: Incorrectly specified erase unit size and alignment will lead to data corruption!''' Find a kernel message about discard support: reiser4: sdX: enable discard support (erase unit 1572864 bytes, alignment 0 bytes) We recommend to use Write-Anywhere (AKA Copy-On-Write) transaction model for SSD drives (mount option "txmod=wa"). Also we recommend to use mount option "noatime" for SSD drives. [[category:Reiser4]] 334493ce4853eb8157c4081bd48ab38047729442 4115 4113 2016-02-12T18:00:58Z Edward 4 /* Clean up garbage */ == Precise real-time discard in Reiser4 for SSD devices == Efficient implementation of real-time discard which doesn't lead to accumulation of garbage on disk (set of erase units which are marked as free in the file system space map, but discard requests wasn't issued for them), and, hence, rids of need to periodically run fstrim (batch discard) on the device == Introduction == Real-time [http://en.wikipedia.org/wiki/Trim_(computing) discard support] means that file system issues discard requests, i.e. informs the block layer about extents of freed space. Currently all Linux file systems with announced feature of real-time discard support issue so-called "lazy" (or "non-precise") discard requests. It means that such file systems report exactly about blocks that were freed. Since erase units in general don't coincide with file system blocks, such "lazy" technique leads to accumulation of garbage on disk. '''DEFINITION'''. ''Garbage is a set of erase units on disk, which are marked free in the file system space map, but discard requests for them were not issued.'' Indeed, for example, if erase unit is larger than file system block, then it can happen that "lazy" discard request contains partial erase units, so that the block layer will round up the start and round down the end of such discard request. This is because on the one hand trim operation is defined only for whole erase units. On the other hand, the block layer doesn't know the status of erase unit, which is freed only partially, and hence it makes an assumption that it its other part is "busy" in the file system's space map (the alternative assumption can lead to data corruption). Note, however, that if such "forced" assumption is incorrect, and the whole erase unit becomes free, then such erase unit will become a garbage. With lazy discard policy user needs to run special tools to clean up the accumulated garbage. So, it would be nice to check the status of partially freed erase units and issue discard request for such unit, if its other part iDEFINITIONs also free (marked as free in the file system's space map). The block layer is not able to perform such checks for obvious reasons: this is a business of the file system. Below we prove that such checking of partially freed erase units and issuing discard requests for the padded extents (we'll call it "precise discard requests") doesn't lead to accumulation of garbage. Efficient issuing precise discard requests without performance drop and ugly workarounds is possible only if the file system possesses an advanced transaction manager like the one of Reiser4. Initial idea of precise discard and its implementation of complexity N_u (where N_u is total number of erase units (including partial ones) in the resulted set of sorted and merged discard requests) belongs to Ivan Shapovalov. Edward Shishkin suggested implementation of complexity 2*N_e, where N_e is total number of extents in such resulted set. == (De)allocation, discard units and alignment. Non-precise and precise coordinates == The minimal unit of all (de)allocation operations in a file system is a file system block of blk_size. The minimal unit of all discard operations is a so-called erase unit of EUS size. Every file system block can be addressed by its (block) number. In this case we'll say about addressing in the system of non-precise coordinates 0Y. In contrast with non-precise coordinates we'll also consider a system 0X of precise coordinates, where every individual byte on the disk can be addressed. In the system 0Y we'll consider (non-precise) extents of blocks (U,V), where U is the number of the start block, and V is the width of the extent (in blocks). In the system 0X we'll consider precise extents of bytes [AB], where A (A < B) is offset of the first byte and (B-1) is offset of the last byte of the extent. So, the length of such segment is B-A. Erase unit size in bytes (EUS) is a property of SSD drive. Generally erase units don't coincide with file system blocks, so we'll address erase units in the system OX of precise coordinates by precise extents. In particular, every erase unit of some SSD partition is represented in precise coordinates as extent [EUO + N * EUS, EUO + (N+1)*EUS] for some natural N, where EUO is the offset of the first complete erase unit (0 <= EUO < EUS). That is, EUO is a property of individual partitions of SSD drives. EUO is also called as "alignment". == Lazy (non-precise) discard policy. Accumulation of garbage on disk == The policy of lazy (non-precise) discard is rather simple: if any extent of blocks (U,V) is freed by the file system, then we issue discard request for the extent [U * blk_size, (U+V) * blk_size]. Suppose now that EUS != blk_size, or EUO != 0 Suppose, the file system deallocates an extent (2,5) and issue the respective "lazy" discard request [2 * blk_size, 7 * blk_size], see the picture below. The block layer assumes that the neighboring blocks #1 and #7 are busy, and, hence, issues discard request for the smaller segment [AB]. Note, however, that if this assumption is incorrect, and blocks #1 and (or) #7 were actually free, then after freeing the extent (2,5) we'll have that the whole erase units [A - EUS, A] and [B, B + EUS] are marked as free in the file system space map, and hence, will replenish the garbage. So, the lazy (non-precise) discard policy leads to accumulation of garbage on disk. * * * * * * * * * > Y 0 1 2 3 4 5 6 7 8 0 blk_size 3*blk_size *-------*-------*-------*-------*-------*-------*-------*-------*--> X ---+--------+--------+--------+--------+--------+--------+--------+> X 0 EUO A-EUS A B B+EUS Comment. There are 2 independent "sources" of garbage in "lazy" discard policy: * "bad" values of erase unit size (EUS != blk_size); * "bad" values of alignment (EUO != 0); == Precise discard policy == The idea is to check all "partially deallocated" erase units. If the whole such unit is marked as free in the file system space map, then we (file system) issue a discard request for the whole unit. That is, in contrast with the lazy discard policy, the file system provides correct status of every partially deallocated discard unit and issues precise discard request for the larger (padded) extents. Let's consider the previous example. In accordance with the precise discard policy file system checks the status of blocks #1 and #7. If both blocks #1 and #7 are free, then file system issues a discard request [A - EUS, B + EUS]. If block #1 is free and block #7 is busy, then file system issues a discard request [A - EUS, B]. If block #1 is busy and block #7 is free, then the file system will issue discard request [A, B + EUS]. Finally, if both blocks #1 and #7 are busy, then the file system issues discard request [A, B]. Note that block layer won't restrict such "precise" discard requests, and, moreover, the following statement takes place: '''THEOREM'''. ''The policy of precise discard doesn't lead to accumulation of garbage on disk''. Proof (sketch). Indeed, suppose that disk doesn't contain "garbage". That is dicard request was issued for every erase unit, which is marked as free in the file system space map. Suppose, the file system deallocates extent (2, 5). If the block #1 is busy and block #7 is free, then, in accordance with precise discard policy, the file system issues "precise" discard request [A, B + EUS]. Note that we must not discard the unit [A - EUS, A], since it contains bytes of the busy block #1. Also, note that we don't need to discard other units due to the assumption, that before deallocation disk didn't contain garbage. Thus, we have that discard requests have been issued for every erase unit, which is marked as free in the file system space map. By the similar way we can prove that "precise" discard policy doesn't leave garbage on disk in other 3 cases (when block #1 is free and block #7 is busy, both blocks #1 and #7 are free, and both blocks #1 and #7 are busy. == Implementation of precise discard == The straightforward solution is to check the status of partially deallocated erase units in the file system's space map. However, efficient implementation of such solution requires an advanced transaction manager and not less advanced block allocator. In particular, you need to make sure that nobody will occupy the other parts of your partially deallocated erase units while you are issuing precise discard requests for them (otherwise, data corruption is possible). Reiser4 block allocator manages the following in-memory data-structures: * working space map (W) * commit space map (C) * deallocation set (D) Allocation in Reise4 is always going on the working space map: (1) W' = alloc(W, R); - allocate a set R of block numbers in the working space map W. Deallocation is a bit more complicated: all freed block numbers at first are recorded in a special data structure - deallocation set D: (2) D' = dealloc(D, R); Before committing a transaction we update the commit space map C at so-called pre_commit_hook(): (3) C' = apply(C, D); After committing the transaction, that is after issuing all write requests (including the commit space map C) we prepare and issue discard requests in so-called post_write_back_hook(): (4) prepare_and_issue_discard_requests(); After issuing discard requests we update the working space map: (5) W' = apply(W, D'); == Handling paddings of partially freed erase units == When preparing discard set at stage (4) we check head (tail) padding of every partial erase unit. If it is free, we allocate it at the working space map: W" = alloc(W', R'); At the same time we record the allocated paddings to the deallocation set: D" = dealloc(D', R'); Updating the working space map at the stage (5) automatically deallocates the paddings: apply(W", D") = W" \ R' = (W' + R')\ R'= W'. == How to test == Find out erase unit size and alignment of your SSD partition. Apply the [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz patch] against reiser4-for-3.17.3. Format a reiser4 partition with reiser4progs-1.0.9. Use mkfs.reiser4 option -d to "discard" the whole partition on your SSD drive at format time. We recommend to use transparent compression for SSD drives (by default it is turned on when formatting with mkfs.reiser4). Mount a reiser4 partition with mount option "discard", specifying (in bytes) erase unit size and alignment of your partition by mount options "discard.unit" and "discard.offset" respectively. For example, if erase unit size of your partition is 1536K and alignment is 0, then the mount command should look like the following: mount /dev/sdX -o discard,discard.unit=1572864,discard.offset=0 /mnt '''WARNING: Incorrectly specified erase unit size and alignment will lead to data corruption!''' Find a kernel message about discard support: reiser4: sdX: enable discard support (erase unit 1572864 bytes, alignment 0 bytes) We recommend to use Write-Anywhere (AKA Copy-On-Write) transaction model for SSD drives (mount option "txmod=wa"). Also we recommend to use mount option "noatime" for SSD drives. [[category:Reiser4]] 7ca93d6f938c29539a9214136c39dc1a4746671f 4113 4085 2016-02-12T17:47:53Z Edward 4 Update "How to test" with more details == Precise real-time discard in Reiser4 for SSD devices == Efficient implementation of real-time discard which doesn't lead to accumulation of garbage on disk (set of erase units which are marked as free in the file system space map, but discard requests wasn't issued for them), and, hence, rids of need to periodically run fstrim (batch discard) on the device == Introduction == Real-time [http://en.wikipedia.org/wiki/Trim_(computing) discard support] means that file system issues discard requests, i.e. informs the block layer about extents of freed space. Currently all Linux file systems with announced feature of real-time discard support issue so-called "lazy" (or "non-precise") discard requests. It means that such file systems report exactly about blocks that were freed. Since erase units in general don't coincide with file system blocks, such "lazy" technique leads to accumulation of garbage on disk. '''DEFINITION'''. ''Garbage is a set of erase units on disk, which are marked free in the file system space map, but discard requests for them were not issued.'' Indeed, for example, if erase unit is larger than file system block, then it can happen that "lazy" discard request contains partial erase units, so that the block layer will round up the start and round down the end of such discard request. This is because on the one hand trim operation is defined only for whole erase units. On the other hand, the block layer doesn't know the status of erase unit, which is freed only partially, and hence it makes an assumption that it its other part is "busy" in the file system's space map (the alternative assumption can lead to data corruption). Note, however, that if such "forced" assumption is incorrect, and the whole erase unit becomes free, then such erase unit will become a garbage. With lazy discard policy user needs to run special tools to clean up the accumulated garbage. So, it would be nice to check the status of partially freed erase units and issue discard request for such unit, if its other part iDEFINITIONs also free (marked as free in the file system's space map). The block layer is not able to perform such checks for obvious reasons: this is a business of the file system. Below we prove that such checking of partially freed erase units and issuing discard requests for the padded extents (we'll call it "precise discard requests") doesn't lead to accumulation of garbage. Efficient issuing precise discard requests without performance drop and ugly workarounds is possible only if the file system possesses an advanced transaction manager like the one of Reiser4. Initial idea of precise discard and its implementation of complexity N_u (where N_u is total number of erase units (including partial ones) in the resulted set of sorted and merged discard requests) belongs to Ivan Shapovalov. Edward Shishkin suggested implementation of complexity 2*N_e, where N_e is total number of extents in such resulted set. == (De)allocation, discard units and alignment. Non-precise and precise coordinates == The minimal unit of all (de)allocation operations in a file system is a file system block of blk_size.Trim computing The minimal unit of all discard operations is a so-called erase unit of EUS size. Every file system block can be addressed by its (block) number. In this case we'll say about addressing in the system of non-precise coordinates 0Y. In contrast with non-precise coordinates we'll also consider a system 0X of precise coordinates, where every individual byte on the disk can be addressed. In the system 0Y we'll consider (non-precise) extents of blocks (U,V), where U is the number of the start block, and V is the width of the extent (in blocks). In the system 0X we'll consider precise extents of bytes [AB], where A (A < B) is offset of the first byte and (B-1) is offset of the last byte of the extent. So, the length of such segment is B-A. Erase unit size in bytes (EUS) is a property of SSD drive. Generally erase units don't coincide with file system blocks, so we'll address erase units in the system OX of precise coordinates by precise extents. In particular, every erase unit of some SSD partition is represented in precise coordinates as extent [EUO + N * EUS, EUO + (N+1)*EUS] for some natural N, where EUO is the offset of the first complete erase unit (0 <= EUO < EUS). That is, EUO is a property of individual partitions of SSD drives. EUO is also called as "alignment". == Lazy (non-precise) discard policy. Accumulation of garbage on disk == The policy of lazy (non-precise) discard is rather simple: if any extent of blocks (U,V) is freed by the file system, then we issue discard request for the extent [U * blk_size, (U+V) * blk_size]. Suppose now that EUS != blk_size, or EUO != 0 Suppose, the file system deallocates an extent (2,5) and issue the respective "lazy" discard request [2 * blk_size, 7 * blk_size], see the picture below. The block layer assumes that the neighboring blocks #1 and #7 are busy, and, hence, issues discard request for the smaller segment [AB]. Note, however, that if this assumption is incorrect, and blocks #1 and (or) #7 were actually free, then after freeing the extent (2,5) we'll have that the whole erase units [A - EUS, A] and [B, B + EUS] are marked as free in the file system space map, and hence, will replenish the garbage. So, the lazy (non-precise) discard policy leads to accumulation of garbage on disk. * * * * * * * * * > Y 0 1 2 3 4 5 6 7 8 0 blk_size 3*blk_size *-------*-------*-------*-------*-------*-------*-------*-------*--> X ---+--------+--------+--------+--------+--------+--------+--------+> X 0 EUO A-EUS A B B+EUS Comment. There are 2 independent "sources" of garbage in "lazy" discard policy: * "bad" values of erase unit size (EUS != blk_size); * "bad" values of alignment (EUO != 0); == Precise discard policy == The idea is to check all "partially deallocated" erase units. If the whole such unit is marked as free in the file system space map, then we (file system) issue a discard request for the whole unit. That is, in contrast with the lazy discard policy, the file system provides correct status of every partially deallocated discard unit and issues precise discard request for the larger (padded) extents. Let's consider the previous example. In accordance with the precise discard policy file system checks the status of blocks #1 and #7. If both blocks #1 and #7 are free, then file system issues a discard request [A - EUS, B + EUS]. If block #1 is free and block #7 is busy, then file system issues a discard request [A - EUS, B]. If block #1 is busy and block #7 is free, then the file system will issue discard request [A, B + EUS]. Finally, if both blocks #1 and #7 are busy, then the file system issues discard request [A, B]. Note that block layer won't restrict such "precise" discard requests, and, moreover, the following statement takes place: '''THEOREM'''. ''The policy of precise discard doesn't lead to accumulation of garbage on disk''. Proof (sketch). Indeed, suppose that disk doesn't contain "garbage". That is dicard request was issued for every erase unit, which is marked as free in the file system space map. Suppose, the file system deallocates extent (2, 5). If the block #1 is busy and block #7 is free, then, in accordance with precise discard policy, the file system issues "precise" discard request [A, B + EUS]. Note that we must not discard the unit [A - EUS, A], since it contains bytes of the busy block #1. Also, note that we don't need to discard other units due to the assumption, that before deallocation disk didn't contain garbage. Thus, we have that discard requests have been issued for every erase unit, which is marked as free in the file system space map. By the similar way we can prove that "precise" discard policy doesn't leave garbage on disk in other 3 cases (when block #1 is free and block #7 is busy, both blocks #1 and #7 are free, and both blocks #1 and #7 are busy. == Implementation of precise discard == The straightforward solution is to check the status of partially deallocated erase units in the file system's space map. However, efficient implementation of such solution requires an advanced transaction manager and not less advanced block allocator. In particular, you need to make sure that nobody will occupy the other parts of your partially deallocated erase units while you are issuing precise discard requests for them (otherwise, data corruption is possible). Reiser4 block allocator manages the following in-memory data-structures: * working space map (W) * commit space map (C) * deallocation set (D) Allocation in Reise4 is always going on the working space map: (1) W' = alloc(W, R); - allocate a set R of block numbers in the working space map W. Deallocation is a bit more complicated: all freed block numbers at first are recorded in a special data structure - deallocation set D: (2) D' = dealloc(D, R); Before committing a transaction we update the commit space map C at so-called pre_commit_hook(): (3) C' = apply(C, D); After committing the transaction, that is after issuing all write requests (including the commit space map C) we prepare and issue discard requests in so-called post_write_back_hook(): (4) prepare_and_issue_discard_requests(); After issuing discard requests we update the working space map: (5) W' = apply(W, D'); == Handling paddings of partially freed erase units == When preparing discard set at stage (4) we check head (tail) padding of every partial erase unit. If it is free, we allocate it at the working space map: W" = alloc(W', R'); At the same time we record the allocated paddings to the deallocation set: D" = dealloc(D', R'); Updating the working space map at the stage (5) automatically deallocates the paddings: apply(W", D") = W" \ R' = (W' + R')\ R'= W'. == How to test == Find out erase unit size and alignment of your SSD partition. Apply the [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz patch] against reiser4-for-3.17.3. Format a reiser4 partition with reiser4progs-1.0.9. Use mkfs.reiser4 option -d to "discard" the whole partition on your SSD drive at format time. We recommend to use transparent compression for SSD drives (by default it is turned on when formatting with mkfs.reiser4). Mount a reiser4 partition with mount option "discard", specifying (in bytes) erase unit size and alignment of your partition by mount options "discard.unit" and "discard.offset" respectively. For example, if erase unit size of your partition is 1536K and alignment is 0, then the mount command should look like the following: mount /dev/sdX -o discard,discard.unit=1572864,discard.offset=0 /mnt '''WARNING: Incorrectly specified erase unit size and alignment will lead to data corruption!''' Find a kernel message about discard support: reiser4: sdX: enable discard support (erase unit 1572864 bytes, alignment 0 bytes) We recommend to use Write-Anywhere (AKA Copy-On-Write) transaction model for SSD drives (mount option "txmod=wa"). Also we recommend to use mount option "noatime" for SSD drives. [[category:Reiser4]] 8f2d39d652631017097be27ef865824240f27451 4085 4045 2015-09-24T19:57:43Z Chris goe 2 category added == Precise real-time discard in Reiser4 for SSD devices == Efficient implementation of real-time discard which doesn't lead to accumulation of garbage on disk (set of erase units which are marked as free in the file system space map, but discard requests wasn't issued for them), and, hence, rids of need to periodically run fstrim (batch discard) on the device == Introduction == Real-time [http://en.wikipedia.org/wiki/Trim_(computing) discard support] means that file system issues discard requests, i.e. informs the block layer about extents of freed space. Currently all Linux file systems with announced feature of real-time discard support issue so-called "lazy" (or "non-precise") discard requests. It means that such file systems report exactly about blocks that were freed. Since erase units in general don't coincide with file system blocks, such "lazy" technique leads to accumulation of garbage on disk. '''DEFINITION'''. ''Garbage is a set of erase units on disk, which are marked free in the file system space map, but discard requests for them were not issued.'' Indeed, for example, if erase unit is larger than file system block, then it can happen that "lazy" discard request contains partial erase units, so that the block layer will round up the start and round down the end of such discard request. This is because on the one hand trim operation is defined only for whole erase units. On the other hand, the block layer doesn't know the status of erase unit, which is freed only partially, and hence it makes an assumption that it its other part is "busy" in the file system's space map (the alternative assumption can lead to data corruption). Note, however, that if such "forced" assumption is incorrect, and the whole erase unit becomes free, then such erase unit will become a garbage. With lazy discard policy user needs to run special tools to clean up the accumulated garbage. So, it would be nice to check the status of partially freed erase units and issue discard request for such unit, if its other part iDEFINITIONs also free (marked as free in the file system's space map). The block layer is not able to perform such checks for obvious reasons: this is a business of the file system. Below we prove that such checking of partially freed erase units and issuing discard requests for the padded extents (we'll call it "precise discard requests") doesn't lead to accumulation of garbage. Efficient issuing precise discard requests without performance drop and ugly workarounds is possible only if the file system possesses an advanced transaction manager like the one of Reiser4. Initial idea of precise discard and its implementation of complexity N_u (where N_u is total number of erase units (including partial ones) in the resulted set of sorted and merged discard requests) belongs to Ivan Shapovalov. Edward Shishkin suggested implementation of complexity 2*N_e, where N_e is total number of extents in such resulted set. == (De)allocation, discard units and alignment. Non-precise and precise coordinates == The minimal unit of all (de)allocation operations in a file system is a file system block of blk_size.Trim computing The minimal unit of all discard operations is a so-called erase unit of EUS size. Every file system block can be addressed by its (block) number. In this case we'll say about addressing in the system of non-precise coordinates 0Y. In contrast with non-precise coordinates we'll also consider a system 0X of precise coordinates, where every individual byte on the disk can be addressed. In the system 0Y we'll consider (non-precise) extents of blocks (U,V), where U is the number of the start block, and V is the width of the extent (in blocks). In the system 0X we'll consider precise extents of bytes [AB], where A (A < B) is offset of the first byte and (B-1) is offset of the last byte of the extent. So, the length of such segment is B-A. Erase unit size in bytes (EUS) is a property of SSD drive. Generally erase units don't coincide with file system blocks, so we'll address erase units in the system OX of precise coordinates by precise extents. In particular, every erase unit of some SSD partition is represented in precise coordinates as extent [EUO + N * EUS, EUO + (N+1)*EUS] for some natural N, where EUO is the offset of the first complete erase unit (0 <= EUO < EUS). That is, EUO is a property of individual partitions of SSD drives. EUO is also called as "alignment". == Lazy (non-precise) discard policy. Accumulation of garbage on disk == The policy of lazy (non-precise) discard is rather simple: if any extent of blocks (U,V) is freed by the file system, then we issue discard request for the extent [U * blk_size, (U+V) * blk_size]. Suppose now that EUS != blk_size, or EUO != 0 Suppose, the file system deallocates an extent (2,5) and issue the respective "lazy" discard request [2 * blk_size, 7 * blk_size], see the picture below. The block layer assumes that the neighboring blocks #1 and #7 are busy, and, hence, issues discard request for the smaller segment [AB]. Note, however, that if this assumption is incorrect, and blocks #1 and (or) #7 were actually free, then after freeing the extent (2,5) we'll have that the whole erase units [A - EUS, A] and [B, B + EUS] are marked as free in the file system space map, and hence, will replenish the garbage. So, the lazy (non-precise) discard policy leads to accumulation of garbage on disk. * * * * * * * * * > Y 0 1 2 3 4 5 6 7 8 0 blk_size 3*blk_size *-------*-------*-------*-------*-------*-------*-------*-------*--> X ---+--------+--------+--------+--------+--------+--------+--------+> X 0 EUO A-EUS A B B+EUS Comment. There are 2 independent "sources" of garbage in "lazy" discard policy: * "bad" values of erase unit size (EUS != blk_size); * "bad" values of alignment (EUO != 0); == Precise discard policy == The idea is to check all "partially deallocated" erase units. If the whole such unit is marked as free in the file system space map, then we (file system) issue a discard request for the whole unit. That is, in contrast with the lazy discard policy, the file system provides correct status of every partially deallocated discard unit and issues precise discard request for the larger (padded) extents. Let's consider the previous example. In accordance with the precise discard policy file system checks the status of blocks #1 and #7. If both blocks #1 and #7 are free, then file system issues a discard request [A - EUS, B + EUS]. If block #1 is free and block #7 is busy, then file system issues a discard request [A - EUS, B]. If block #1 is busy and block #7 is free, then the file system will issue discard request [A, B + EUS]. Finally, if both blocks #1 and #7 are busy, then the file system issues discard request [A, B]. Note that block layer won't restrict such "precise" discard requests, and, moreover, the following statement takes place: '''THEOREM'''. ''The policy of precise discard doesn't lead to accumulation of garbage on disk''. Proof (sketch). Indeed, suppose that disk doesn't contain "garbage". That is dicard request was issued for every erase unit, which is marked as free in the file system space map. Suppose, the file system deallocates extent (2, 5). If the block #1 is busy and block #7 is free, then, in accordance with precise discard policy, the file system issues "precise" discard request [A, B + EUS]. Note that we must not discard the unit [A - EUS, A], since it contains bytes of the busy block #1. Also, note that we don't need to discard other units due to the assumption, that before deallocation disk didn't contain garbage. Thus, we have that discard requests have been issued for every erase unit, which is marked as free in the file system space map. By the similar way we can prove that "precise" discard policy doesn't leave garbage on disk in other 3 cases (when block #1 is free and block #7 is busy, both blocks #1 and #7 are free, and both blocks #1 and #7 are busy. == Implementation of precise discard == The straightforward solution is to check the status of partially deallocated erase units in the file system's space map. However, efficient implementation of such solution requires an advanced transaction manager and not less advanced block allocator. In particular, you need to make sure that nobody will occupy the other parts of your partially deallocated erase units while you are issuing precise discard requests for them (otherwise, data corruption is possible). Reiser4 block allocator manages the following in-memory data-structures: * working space map (W) * commit space map (C) * deallocation set (D) Allocation in Reise4 is always going on the working space map: (1) W' = alloc(W, R); - allocate a set R of block numbers in the working space map W. Deallocation is a bit more complicated: all freed block numbers at first are recorded in a special data structure - deallocation set D: (2) D' = dealloc(D, R); Before committing a transaction we update the commit space map C at so-called pre_commit_hook(): (3) C' = apply(C, D); After committing the transaction, that is after issuing all write requests (including the commit space map C) we prepare and issue discard requests in so-called post_write_back_hook(): (4) prepare_and_issue_discard_requests(); After issuing discard requests we update the working space map: (5) W' = apply(W, D'); == Handling paddings of partially freed erase units == When preparing discard set at stage (4) we check head (tail) padding of every partial erase unit. If it is free, we allocate it at the working space map: W" = alloc(W', R'); At the same time we record the allocated paddings to the deallocation set: D" = dealloc(D', R'); Updating the working space map at the stage (5) automatically deallocates the paddings: apply(W", D") = W" \ R' = (W' + R')\ R'= W'. == How to test == Apply the [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz patch] against reiser4-for-3.17.3. Format a reiser4 partition with reiser4progs-1.0.9. Use mkfs.reiser4 option -d to "discard" the whole partition on your SSD drive at format time. We recommend to use transparent compression for SSD drives (by default it is turned on when formatting with mkfs.reiser4). Mount a reiser4 partition with mount option "discard". Find a kernel message about discard support: reiser4: sdX: enable discard support (erase unit Y bytes, alignment Z bytes) We recommend to use Write-Anywhere (AKA Copy-On-Write) transaction model for SSD drives (mount option "txmod=wa"). Also we recommend to use mount option "noatime" for SSD drives. [[category:Reiser4]] 3c4895d119a19e42b7b23097abc00cd7d166ba12 4045 4044 2015-02-03T23:00:09Z Edward 4 == Precise real-time discard in Reiser4 for SSD devices == Efficient implementation of real-time discard which doesn't lead to accumulation of garbage on disk (set of erase units which are marked as free in the file system space map, but discard requests wasn't issued for them), and, hence, rids of need to periodically run fstrim (batch discard) on the device == Introduction == Real-time [http://en.wikipedia.org/wiki/Trim_(computing) discard support] means that file system issues discard requests, i.e. informs the block layer about extents of freed space. Currently all Linux file systems with announced feature of real-time discard support issue so-called "lazy" (or "non-precise") discard requests. It means that such file systems report exactly about blocks that were freed. Since erase units in general don't coincide with file system blocks, such "lazy" technique leads to accumulation of garbage on disk. '''DEFINITION'''. ''Garbage is a set of erase units on disk, which are marked free in the file system space map, but discard requests for them were not issued.'' Indeed, for example, if erase unit is larger than file system block, then it can happen that "lazy" discard request contains partial erase units, so that the block layer will round up the start and round down the end of such discard request. This is because on the one hand trim operation is defined only for whole erase units. On the other hand, the block layer doesn't know the status of erase unit, which is freed only partially, and hence it makes an assumption that it its other part is "busy" in the file system's space map (the alternative assumption can lead to data corruption). Note, however, that if such "forced" assumption is incorrect, and the whole erase unit becomes free, then such erase unit will become a garbage. With lazy discard policy user needs to run special tools to clean up the accumulated garbage. So, it would be nice to check the status of partially freed erase units and issue discard request for such unit, if its other part iDEFINITIONs also free (marked as free in the file system's space map). The block layer is not able to perform such checks for obvious reasons: this is a business of the file system. Below we prove that such checking of partially freed erase units and issuing discard requests for the padded extents (we'll call it "precise discard requests") doesn't lead to accumulation of garbage. Efficient issuing precise discard requests without performance drop and ugly workarounds is possible only if the file system possesses an advanced transaction manager like the one of Reiser4. Initial idea of precise discard and its implementation of complexity N_u (where N_u is total number of erase units (including partial ones) in the resulted set of sorted and merged discard requests) belongs to Ivan Shapovalov. Edward Shishkin suggested implementation of complexity 2*N_e, where N_e is total number of extents in such resulted set. == (De)allocation, discard units and alignment. Non-precise and precise coordinates == The minimal unit of all (de)allocation operations in a file system is a file system block of blk_size.Trim computing The minimal unit of all discard operations is a so-called erase unit of EUS size. Every file system block can be addressed by its (block) number. In this case we'll say about addressing in the system of non-precise coordinates 0Y. In contrast with non-precise coordinates we'll also consider a system 0X of precise coordinates, where every individual byte on the disk can be addressed. In the system 0Y we'll consider (non-precise) extents of blocks (U,V), where U is the number of the start block, and V is the width of the extent (in blocks). In the system 0X we'll consider precise extents of bytes [AB], where A (A < B) is offset of the first byte and (B-1) is offset of the last byte of the extent. So, the length of such segment is B-A. Erase unit size in bytes (EUS) is a property of SSD drive. Generally erase units don't coincide with file system blocks, so we'll address erase units in the system OX of precise coordinates by precise extents. In particular, every erase unit of some SSD partition is represented in precise coordinates as extent [EUO + N * EUS, EUO + (N+1)*EUS] for some natural N, where EUO is the offset of the first complete erase unit (0 <= EUO < EUS). That is, EUO is a property of individual partitions of SSD drives. EUO is also called as "alignment". == Lazy (non-precise) discard policy. Accumulation of garbage on disk == The policy of lazy (non-precise) discard is rather simple: if any extent of blocks (U,V) is freed by the file system, then we issue discard request for the extent [U * blk_size, (U+V) * blk_size]. Suppose now that EUS != blk_size, or EUO != 0 Suppose, the file system deallocates an extent (2,5) and issue the respective "lazy" discard request [2 * blk_size, 7 * blk_size], see the picture below. The block layer assumes that the neighboring blocks #1 and #7 are busy, and, hence, issues discard request for the smaller segment [AB]. Note, however, that if this assumption is incorrect, and blocks #1 and (or) #7 were actually free, then after freeing the extent (2,5) we'll have that the whole erase units [A - EUS, A] and [B, B + EUS] are marked as free in the file system space map, and hence, will replenish the garbage. So, the lazy (non-precise) discard policy leads to accumulation of garbage on disk. * * * * * * * * * > Y 0 1 2 3 4 5 6 7 8 0 blk_size 3*blk_size *-------*-------*-------*-------*-------*-------*-------*-------*--> X ---+--------+--------+--------+--------+--------+--------+--------+> X 0 EUO A-EUS A B B+EUS Comment. There are 2 independent "sources" of garbage in "lazy" discard policy: * "bad" values of erase unit size (EUS != blk_size); * "bad" values of alignment (EUO != 0); == Precise discard policy == The idea is to check all "partially deallocated" erase units. If the whole such unit is marked as free in the file system space map, then we (file system) issue a discard request for the whole unit. That is, in contrast with the lazy discard policy, the file system provides correct status of every partially deallocated discard unit and issues precise discard request for the larger (padded) extents. Let's consider the previous example. In accordance with the precise discard policy file system checks the status of blocks #1 and #7. If both blocks #1 and #7 are free, then file system issues a discard request [A - EUS, B + EUS]. If block #1 is free and block #7 is busy, then file system issues a discard request [A - EUS, B]. If block #1 is busy and block #7 is free, then the file system will issue discard request [A, B + EUS]. Finally, if both blocks #1 and #7 are busy, then the file system issues discard request [A, B]. Note that block layer won't restrict such "precise" discard requests, and, moreover, the following statement takes place: '''THEOREM'''. ''The policy of precise discard doesn't lead to accumulation of garbage on disk''. Proof (sketch). Indeed, suppose that disk doesn't contain "garbage". That is dicard request was issued for every erase unit, which is marked as free in the file system space map. Suppose, the file system deallocates extent (2, 5). If the block #1 is busy and block #7 is free, then, in accordance with precise discard policy, the file system issues "precise" discard request [A, B + EUS]. Note that we must not discard the unit [A - EUS, A], since it contains bytes of the busy block #1. Also, note that we don't need to discard other units due to the assumption, that before deallocation disk didn't contain garbage. Thus, we have that discard requests have been issued for every erase unit, which is marked as free in the file system space map. By the similar way we can prove that "precise" discard policy doesn't leave garbage on disk in other 3 cases (when block #1 is free and block #7 is busy, both blocks #1 and #7 are free, and both blocks #1 and #7 are busy. == Implementation of precise discard == The straightforward solution is to check the status of partially deallocated erase units in the file system's space map. However, efficient implementation of such solution requires an advanced transaction manager and not less advanced block allocator. In particular, you need to make sure that nobody will occupy the other parts of your partially deallocated erase units while you are issuing precise discard requests for them (otherwise, data corruption is possible). Reiser4 block allocator manages the following in-memory data-structures: * working space map (W) * commit space map (C) * deallocation set (D) Allocation in Reise4 is always going on the working space map: (1) W' = alloc(W, R); - allocate a set R of block numbers in the working space map W. Deallocation is a bit more complicated: all freed block numbers at first are recorded in a special data structure - deallocation set D: (2) D' = dealloc(D, R); Before committing a transaction we update the commit space map C at so-called pre_commit_hook(): (3) C' = apply(C, D); After committing the transaction, that is after issuing all write requests (including the commit space map C) we prepare and issue discard requests in so-called post_write_back_hook(): (4) prepare_and_issue_discard_requests(); After issuing discard requests we update the working space map: (5) W' = apply(W, D'); == Handling paddings of partially freed erase units == When preparing discard set at stage (4) we check head (tail) padding of every partial erase unit. If it is free, we allocate it at the working space map: W" = alloc(W', R'); At the same time we record the allocated paddings to the deallocation set: D" = dealloc(D', R'); Updating the working space map at the stage (5) automatically deallocates the paddings: apply(W", D") = W" \ R' = (W' + R')\ R'= W'. == How to test == Apply the [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz patch] against reiser4-for-3.17.3. Format a reiser4 partition with reiser4progs-1.0.9. Use mkfs.reiser4 option -d to "discard" the whole partition on your SSD drive at format time. We recommend to use transparent compression for SSD drives (by default it is turned on when formatting with mkfs.reiser4). Mount a reiser4 partition with mount option "discard". Find a kernel message about discard support: reiser4: sdX: enable discard support (erase unit Y bytes, alignment Z bytes) We recommend to use Write-Anywhere (AKA Copy-On-Write) transaction model for SSD drives (mount option "txmod=wa"). Also we recommend to use mount option "noatime" for SSD drives. 241c944ac101bb92d352e793228a2128fbb88222 4044 2015-02-03T18:15:47Z Edward 4 Add a page devoted to precise discard == Precise real-time discard in Reiser4 for SSD devices == Efficient implementation of real-time discard which doesn't lead to accumulation of garbage on disk (set of erase units which are marked as free in the file system space map, but discard requests wasn't issued for them), and, hence, rids of need to periodically run fstrim (batch discard) on the device == Introduction == Real-time discard support(*) means that file system issues discard requests, i.e. informs the block layer about extents of freed space. Currently all Linux file systems with announced feature of real-time discard support issue so-called "lazy" (or "non-precise") discard requests. It means that such file systems report exactly about blocks that were freed. Since erase units in general don't coincide with file system blocks, such "lazy" technique leads to accumulation of garbage on disk. '''DEFINITION'''. Garbage is a set of erase units on disk, which are marked free in the file system space map, but discard requests for them were not issued. Indeed, for example, if erase unit is larger than file system block, then it can happen that "lazy" discard request contains partial erase units, so that the block layer will round up the start and round down the end of such discard request. This is because on the one hand trim operation is defined only for whole erase units. On the other hand, the block layer doesn't know the status of erase unit, which is freed only partially, and hence it makes an assumption that it its other part is "busy" in the file system's space map (the alternative assumption can lead to data corruption). Note, however, that if such "forced" assumption is incorrect, and the whole erase unit becomes free, then such erase unit will become a garbage. With lazy discard policy user needs to run special tools to clean up the accumulated garbage. So, it would be nice to check the status of partially freed erase units and issue discard request for such unit, if its other part iDEFINITIONs also free (marked as free in the file system's space map). The block layer is not able to perform such checks for obvious reasons: this is a business of the file system. Below we prove that such checking of partially freed erase units and issuing discard requests for the padded extents (we'll call it "precise discard requests") doesn't lead to accumulation of garbage. Efficient issuing precise discard requests without performance drop and ugly workarounds is possible only if the file system possesses an advanced transaction manager like the one of Reiser4. Initial idea of precise discard and its implementation of complexity N_u (where N_u is total number of erase units (including partial ones) in the resulted set of sorted and merged discard requests) belongs to Ivan Shapovalov. Edward Shishkin suggested implementation of complexity 2*N_e, where N_e is total number of extents in such resulted set. (*) For more details about trim/discard see [http://en.wikipedia.org/wiki/Trim_(computing) Trim computing] == (De)allocation, discard units and alignment. Non-precise and precise coordinates == The minimal unit of all (de)allocation operations in a file system is a file system block of blk_size.Trim computing The minimal unit of all discard operations is a so-called erase unit of EUS size. Every file system block can be addressed by its (block) number. In this case we'll say about addressing in the system of non-precise coordinates 0Y. In contrast with non-precise coordinates we'll also consider a system 0X of precise coordinates, where every individual byte on the disk can be addressed. In the system 0Y we'll consider (non-precise) extents of blocks (U,V), where U is the number of the start block, and V is the width of the extent (in blocks). In the system 0X we'll consider precise extents of bytes [AB], where A (A < B) is offset of the first byte and (B-1) is offset of the last byte of the extent. So, the length of such segment is B-A. Erase unit size in bytes (EUS) is a property of SSD drive. Generally erase units don't coincide with file system blocks, so we'll address erase units in the system OX of precise coordinates by precise extents. In particular, every erase unit of some SSD partition is represented in precise coordinates as extent [EUO + N * EUS, EUO + (N+1)*EUS] for some natural N, where EUO is the offset of the first complete erase unit (0 <= EUO < EUS). That is, EUO is a property of individual partitions of SSD drives. EUO is also called as "alignment". == Lazy (non-precise) discard policy. Accumulation of garbage on disk == The policy of lazy (non-precise) discard is rather simple: if any extent of blocks (U,V) is freed by the file system, then we issue discard request for the extent [U * blk_size, (U+V) * blk_size]. Suppose now that EUS != blk_size, or EUO != 0 Suppose, the file system deallocates an extent (2,5) and issue the respective "lazy" discard request [2 * blk_size, 7 * blk_size], see the picture below. The block layer assumes that the neighboring blocks #1 and #7 are busy, and, hence, issues discard request for the smaller segment [AB]. Note, however, that if this assumption is incorrect, and blocks #1 and (or) #7 were actually free, then after freeing the extent (2,5) we'll have that the whole erase units [A - EUS, A] and [B, B + EUS] are marked as free in the file system space map, and hence, will replenish the garbage. So, the lazy (non-precise) discard policy leads to accumulation of garbage on disk. * * * * * * * * * > Y 0 1 2 3 4 5 6 7 8 0 blk_size 3*blk_size *-------*-------*-------*-------*-------*-------*-------*-------*--> X ---+--------+--------+--------+--------+--------+--------+--------+> X 0 EUO A-EUS A B B+EUS Comment. There are 2 independent "sources" of garbage in "lazy" discard policy: * "bad" values of erase unit size (EUS != blk_size); * "bad" values of alignment (EUO != 0); == Precise discard policy == The idea is to check all "partially deallocated" erase units. If the whole such unit is marked as free in the file system space map, then we (file system) issue a discard request for the whole unit. That is, in contrast with the lazy discard policy, the file system provides correct status of every partially deallocated discard unit and issues precise discard request for the larger (padded) extents. Let's consider the previous example. In accordance with the precise discard policy file system checks the status of blocks #1 and #7. If both blocks #1 and #7 are free, then file system issues a discard request [A - EUS, B + EUS]. If block #1 is free and block #7 is busy, then file system issues a discard request [A - EUS, B]. If block #1 is busy and block #7 is free, then the file system will issue discard request [A, B + EUS]. Finally, if both blocks #1 and #7 are busy, then the file system issues discard request [A, B]. Note that block layer won't restrict such "precise" discard requests, and, moreover, the following statement takes place: THEOREM. The policy of precise discard doesn't lead to accumulation of garbage on disk. Proof (sketch). Indeed, suppose that disk doesn't contain "garbage". That is dicard request was issued for every erase unit, which is marked as free in the file system space map. Suppose, the file system deallocates extent (2, 5). If the block #1 is busy and block #7 is free, then, in accordance with precise discard policy, the file system issues "precise" discard request [A, B + EUS]. Note that we must not discard the unit [A - EUS, A], since it contains bytes of the busy block #1. Also, note that we don't need to discard other units due to the assumption, that before deallocation disk didn't contain garbage. Thus, we have that discard requests have been issued for every erase unit, which is marked as free in the file system space map. By the similar way we can prove that "precise" discard policy doesn't leave garbage on disk in other 3 cases (when block #1 is free and block #7 is busy, both blocks #1 and #7 are free, and both blocks #1 and #7 are busy. == Implementation of precise discard == The straightforward solution is to check the status of partially deallocated erase units in the file system's space map. However, efficient implementation of such solution requires an advanced transaction manager and not less advanced block allocator. In particular, you need to make sure that nobody will occupy the other parts of your partially deallocated erase units while you are issuing precise discard requests for them (otherwise, data corruption is possible). Reiser4 block allocator manages the following in-memory data-structures: . working space map (W) . commit space map (C) . deallocation set (D) Allocation in Reise4 is always going on the working space map: (1) W' = alloc(W, R); - allocate a set R of block numbers in the working space map W. Deallocation is a bit more complicated: all freed block numbers at first are recorded in a special data structure - deallocation set D: (2) D' = dealloc(D, R); Before committing a transaction we update the commit space map C at so-called pre_commit_hook(): (3) C' = apply(C, D); After committing the transaction, that is after issuing all write requests (including the commit space map C) we prepare and issue discard requests in so-called post_write_back_hook(): (4) prepare_and_issue_discard_requests(); After issuing discard requests we update the working space map: (5) W' = apply(W, D'); == Handling paddings of partial erase units == When preparing discard set at stage (4) we check head (tail) padding of every partial erase unit. If it is free, we allocate it at the working space map: W" = alloc(W', R'); At the same time we record the allocated paddings to the deallocation set: D" = dealloc(D', R'); Updating the working space map at the stage (5) automatically deallocates the paddings: apply(W", D") = W" \ R' = (W' + R')\ R'= W'. == How to test == Apply the [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz patch] against reiser4-for-3.17.3. Format a reiser4 partition with reiser4progs-1.0.9. Use mkfs.reiser4 option -d to "discard" the whole partition on your SSD drive at format time. We recommend to use transparent compression for SSD drives (by default it is turned on when formatting with mkfs.reiser4). Mount a reiser4 partition with mount option "discard". Find a kernel message about discard support: reiser4: sdX: enable discard support (erase unit Y bytes, alignment Z bytes) We recommend to use Write-Anywhere (AKA Copy-On-Write) transaction model for SSD drives (mount option "txmod=wa"). Also we recommend to use mount option "noatime" for SSD drives. 95daddd3631d22b82961c92da0989e8d25f0275f Proxy Device Administration 0 1111 4467 4466 2022-07-13T08:37:52Z Edward 4 /* Flushing a proxy device */ '''Proxy device''' of a logical volume is a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for new data. In other words, any new data will get first to the proxy device. All data on the proxy device is a subject of flushing by any user-defined strategy (LFU, LRU, ARC, FIFO, MRU, etc). The flushed data get automatically distributed on the rest of the volume by a regular way defined on it. Proxy bricks are supported by reiser5 (reiser4 experimental format 5.1.3). See [[Logical_Volumes_Howto|Logical Volumes stuff]] for more details. Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [[Logical_Volumes_Background|background]] and [[Logical_Volumes_Administration|administration]] bits. = Adding a proxy device to a logical volume = At any time you are able to add a proxy device to your logical volume. Currently only one proxy device per logical volume is supported. Also there a restriction that the volume shouldn't be marked as a "volume with incomplete removal". Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user should take care on clearing the proxy device. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the rest of the volume. Note, however, that by default it will lead to essential performance drop (because of attempts to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify the advantage of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to re-store fair distribution on the volume after adding/removing a device to/from the volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true; do do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [[Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash|instructions]] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [[Logical_Volumes_Administration#LV_monitoring|this]]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] 33763e343bfa721e68a2029cb7b3bfa234237c34 4466 4465 2022-07-13T08:05:40Z Edward 4 summary update '''Proxy device''' of a logical volume is a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for new data. In other words, any new data will get first to the proxy device. All data on the proxy device is a subject of flushing by any user-defined strategy (LFU, LRU, ARC, FIFO, MRU, etc). The flushed data get automatically distributed on the rest of the volume by a regular way defined on it. Proxy bricks are supported by reiser5 (reiser4 experimental format 5.1.3). See [[Logical_Volumes_Howto|Logical Volumes stuff]] for more details. Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [[Logical_Volumes_Background|background]] and [[Logical_Volumes_Administration|administration]] bits. = Adding a proxy device to a logical volume = At any time you are able to add a proxy device to your logical volume. Currently only one proxy device per logical volume is supported. Also there a restriction that the volume shouldn't be marked as a "volume with incomplete removal". Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true; do do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [[Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash|instructions]] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [[Logical_Volumes_Administration#LV_monitoring|this]]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] 89e9a8c72a86fd33077ec3d979976027e0c984f8 4465 4464 2022-07-13T08:04:14Z Edward 4 sunnary update '''Proxy device''' of a logical volume is a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for data stripes. In other words, any new data will get first to the proxy device. All data on the proxy device is a subject of flushing by any user-defined strategy (LFU, LRU, ARC, FIFO, MRU, etc). The flushed data get automatically distributed on the rest of the volume by a regular way defined on it. Proxy bricks are supported by reiser5 (reiser4 experimental format 5.1.3). See [[Logical_Volumes_Howto|Logical Volumes stuff]] for more details. Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [[Logical_Volumes_Background|background]] and [[Logical_Volumes_Administration|administration]] bits. = Adding a proxy device to a logical volume = At any time you are able to add a proxy device to your logical volume. Currently only one proxy device per logical volume is supported. Also there a restriction that the volume shouldn't be marked as a "volume with incomplete removal". Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true; do do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [[Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash|instructions]] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [[Logical_Volumes_Administration#LV_monitoring|this]]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] 84b302c69fee50c6e0f724e27d73c39baf4fdffe 4464 4463 2022-07-13T07:56:52Z Edward 4 summary update '''Proxy device''' of a logical volume is a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for data stripes. In other words, any new data will get first to the proxy device. All data on the proxy device is a subject of flushing by any user-defined strategy (LFU, LRU, ARC, FIFO, MRU, etc). Proxy bricks are supported by reiser5 (reiser4 experimental format 5.1.3). See [[Logical_Volumes_Howto|Logical Volumes stuff]] for more details. Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [[Logical_Volumes_Background|background]] and [[Logical_Volumes_Administration|administration]] bits. = Adding a proxy device to a logical volume = At any time you are able to add a proxy device to your logical volume. Currently only one proxy device per logical volume is supported. Also there a restriction that the volume shouldn't be marked as a "volume with incomplete removal". Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true; do do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [[Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash|instructions]] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [[Logical_Volumes_Administration#LV_monitoring|this]]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] 9a58edfcd0342de79529b0cbaac93a88f394e198 4463 4445 2022-07-13T07:31:16Z Edward 4 '''Proxy device''' of a logical volume is a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for data stripes. In other words, any new data will get first to the proxy device. Proxy bricks are supported by reiser5 (reiser4 experimental format 5.1.3). See [[Logical_Volumes_Howto|Logical Volumes stuff]] for more details. Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [[Logical_Volumes_Background|background]] and [[Logical_Volumes_Administration|administration]] bits. = Adding a proxy device to a logical volume = At any time you are able to add a proxy device to your logical volume. Currently only one proxy device per logical volume is supported. Also there a restriction that the volume shouldn't be marked as a "volume with incomplete removal". Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true; do do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [[Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash|instructions]] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [[Logical_Volumes_Administration#LV_monitoring|this]]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] ad330dd5738a3d173a84f54237bb840f1120f5b1 4445 4443 2020-11-12T16:59:30Z Chris goe 2 links wikified '''Proxy device''' of a logical volume is defined as a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for data stripes. Proxy bricks are supported by reiser5 (reiser4 experimental format 5.1.3). See [[Logical_Volumes_Howto|Logical Volumes stuff]] for more details. Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [[Logical_Volumes_Background|background]] and [[Logical_Volumes_Administration|administration]] bits. = Adding a proxy device to a logical volume = At any time you are able to add a proxy device to your logical volume. Currently only one proxy device per logical volume is supported. Also there a restriction that the volume shouldn't be marked as a "volume with incomplete removal". Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true; do do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [[Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash|instructions]] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [[Logical_Volumes_Administration#LV_monitoring|this]]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on [[Proxy_Device_Administration#Flushing_a_proxy_device|flushing]] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] 5973d24d31cf503a7d2412b7b6527126416a170f 4443 4442 2020-11-12T16:44:34Z Edward 4 '''Proxy device''' of a logical volume is defined as a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for data stripes. Proxy bricks are supported by reiser5 (reiser4 experimental format 5.1.3). See [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Howto Logical Volumes stuff] for more details. Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = At any time you are able to add a proxy device to your logical volume. Currently only one proxy device per logical volume is supported. Also there a restriction that the volume shouldn't be marked as a "volume with incomplete removal". Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] 2157632083394ac60114fc5bbf8fde7bf1e1e5e4 4442 4440 2020-11-12T16:31:58Z Edward 4 Proxy brick of a logical volume is defined as a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for data stripes. Proxy bricks are supported by reiser5 (reiser4 experimental format 5.1.3). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] fa7ac21365f52beb443d29bb847c2b4576191233 4440 4389 2020-11-12T16:25:09Z Edward 4 Proxy bricks are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ reiser4-for-5.6.0.patch] (or newer) and install [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.1] (or newer). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. Proxy brick of a logical volume is defined as a brick, which doesn't participate in regular data distribution, but has a priority when allocating addresses for data stripes. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] c31a3bc20343f646f8dce53acf0a174906645c5c 4389 4370 2020-08-16T23:24:53Z Edward 4 Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ reiser4-for-5.6.0.patch] (or newer) and install [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.1] (or newer). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. [[category:Reiser4]] fd3a8e8f09d5f474bc9ff25fcf06959f61c87afc 4370 4369 2020-05-26T01:47:20Z Edward 4 /* Flushing a proxy device */ Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ reiser4-for-5.6.0.patch] (or newer) and install [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.1] (or newer). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device can be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 0af1efee387aeda5d173fbef391068f23d41a3c6 4369 4368 2020-05-26T01:46:40Z Edward 4 /* Flushing a proxy device */ Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ reiser4-for-5.6.0.patch] (or newer) and install [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.1] (or newer). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will end soon, and the advantage of having a proxy device will disappear. Once free space on the proxy device gets ended, all allocations will automatically happen on the main volume. Note, however, that by default it will drop the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data, as it will nullify all the advantages of having a proxy-device! Flushing proxy device cabn be performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the Burst Buffers stuff becomes stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 78da155f4b727969759e34cc101220bd3cdb6aac 4368 4367 2020-05-26T01:37:56Z Edward 4 /* Adding a proxy device to a logical volume */ Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ reiser4-for-5.6.0.patch] (or newer) and install [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.1] (or newer). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV is the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data: it will nullify all the advantages of having a proxy-device! Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. d9c0acaf124b6acd7ff59a008cbe4b4e63ebc32f 4367 4366 2020-05-26T01:20:02Z Edward 4 /* Flushing a proxy device */ Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ reiser4-for-5.6.0.patch] (or newer) and install [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.1] (or newer). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data: it will nullify all the advantages of having a proxy-device! Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. At the moment, we are making the user responsible for this. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. c92bbe2a40473faaf44ad740287ffc73383afb1c 4366 4365 2020-05-26T01:14:38Z Edward 4 fix typo Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ reiser4-for-5.6.0.patch] (or newer) and install [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.1] (or newer). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background background] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data: it will nullify all the advantages of having a proxy-device! Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 09588ca026c06df6a56535eff71521f113ab8d9c 4365 4364 2020-05-25T21:42:15Z Edward 4 Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with [https://sourceforge.net/projects/reiser4/files/v5-unstable/kernel/ reiser4-for-5.6.0.patch] (or newer) and install [https://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ reiser4progs-2.0.1] (or newer). Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data: it will nullify all the advantages of having a proxy-device! Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. c0a3bbf08335f3b9e4d03a3bc520f151774a472a 4364 4363 2020-05-25T17:33:30Z Edward 4 /* Flushing a proxy device */ Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with reiser4-for-5.6 and install reiser4progs-2.0.1 Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. Anyway, don't allow your proxy device to be completely filled with data: it will nullify all the advantages of having a proxy-device! Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 168f0316bf4e1db52fdce59417499f52f91f7f82 4363 4362 2020-05-25T17:28:18Z Edward 4 /* Flushing a proxy device */ Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with reiser4-for-5.6 and install reiser4progs-2.0.1 Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance (because of trying to commit all transactions at every write iteration). Optionally it is possible to choose a mode without any performance drop, however, in that mode proxy disk space will be used less efficiently. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. c8a468a20df10fbf47a303dc5d3880f50cfbaf8d 4362 4361 2020-05-24T12:59:55Z Edward 4 Proxy devices are supported by reiser4 experimental format 5.1.3. You'll need to patch your kernel with reiser4-for-5.6 and install reiser4progs-2.0.1 Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxyhttps://sourceforge.net/projects/reiser4/files/v5-unstable/progs/ device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 3b74c7707bc0d126089881ebce82bd4d3b6ccb04 4361 4360 2020-05-09T00:55:35Z Edward 4 Proxy device represents a tier in "Burst Buffers" strategy of data tiering. Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. d7539766f8e057e9f5356ce7351189a95b047507 4360 4359 2020-05-09T00:06:19Z Edward 4 /* Using meta-data brick as proxy device */ Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 7155dfa54f9f7c60bdb3b063d08b28c7ac3f27cd 4359 4358 2020-05-09T00:03:36Z Edward 4 /* Removing a proxy device from a logical volume */ Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. d08e7c6467e62658558fd5325749b775277d2a9e 4358 4357 2020-05-09T00:02:52Z Edward 4 /* Using meta-data brick as proxy device */ Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_Proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_a_proxy_device flushing] are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 1666d39281e781de54b7b7c53b4115c903c9a751 4357 4356 2020-05-08T23:59:35Z Edward 4 /* Removing a proxy device from a logical volume */ Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration#Flushing_Proxy_device flushing]. If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its flushing are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. b5a52078eb7fa5b9050285dbbdbd53cea27266c1 4356 4354 2020-05-02T19:06:01Z Edward 4 Before working with proxy devices you need to understand basic principles of reiser4 logical volumes, including [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Background backgound] and [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration] bits. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUID -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of flushing (see section above). If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its flushing are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 68b0e4159eaf8d3299b50bcf4a2bc50bbaeeeff9 4354 4351 2020-05-02T17:25:14Z Edward 4 /* Adding a proxy device to a logical volume */ Before working with proxy devices you need to know basic principles of reiser5 logical volumes [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration]. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. Also any device, which participates in regular data distribution can not be a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUId -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of flushing (see section above). If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its flushing are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. e9df3d65905b2b406d202fe2f417c364ad12fd7d 4351 2020-05-02T16:46:33Z Edward 4 Added a page "Proxy Device Administration" Before working with proxy devices you need to know basic principles of reiser5 logical volumes [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration administration]. = Adding a proxy device to a logical volume = You are able to add a proxy device to your logical volume at any time. Currently only one proxy device per logical volume is supported. Also there a restriction that no volume operations (like adding/removing a device to/from this logical volume) should be in progress at the moment of adding a proxy device. After adding proxy device to logical volume, capacity of the last one gets increased precisely on the capacity of that proxy device. Operation of adding a proxy device will automatically turn on specified tiering policy. Currently there is a single one - Burst Buffers. Thus, once being added to a logical volume, proxy device gets an absolute priority in block allocations (including journal blocks (wandering logs). I remind that in reiser4 all disk allocations are always "delayed", that is, performed at commit time. In other bits proxy device is the most ordinary brick like other devices-components of your logical volume. Before adding to a logical volume, proxy device should be formatted like other bricks. Respectively, at format time you need to specify UUID and stripe size of your logical volume: mkfs.reiser4 -U UUId -t STRIPE_SIZE DEV In order to add a proxy device to your logical volume simply execute volume.reiser4 -x DEV MNT where DEV it the name of your properly formatted proxy device, MNT is mount point of your logical volume. The procedure of adding a proxy device is always quick and is not accompanied with any data migration. = Flushing a proxy device = After being added to your logical volume, the proxy device automatically becomes a home of all new allocations, and hence, persistently gets filled with data. So user needs to take care on flushing data to the main storage. Without it the free space on your proxy device will be ended soon, and the advantage of having a proxy device will disappear. When free space on the proxy device gets ended, allocation will automatically happen on the main volume. Note, however, that by default it will slow down the overall performance, just because it is rather expensive to check, that free space on your proxy device has really ended. So don't allow your proxy device to be completely filled with data! There is one more option though: when reservation on the proxy device gets ended, allocation will happen on the main volume. In this case no performance drop will happen, however, disk space of your proxy device will be used less efficiently (as reservations are performed basing on estimations, some of them are rather rough). Flushing proxy device is performed via common migration procedure (the same procedure is used to migrate data when adding/removing a device to/from a logical volume). So, in order to flush your proxy device, just execute volume.reiser4 -b MNT where MNT is mount point of your logical volume. Like every user-space application the flushing procedure may return error. The list of "regular" errors is: EBUSY means that flushing procedure gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some flushing subroutine ENOSPC means not enough space on main storage In case of returned EBUSY you just need to repeat the flushing procedure. In other cases put efforts (free some disk space on your main storage, etc) to make sure that such errors won't happen next time and then repeat the flushing procedure. If the flushing procedure was interrupted for some reason (e.g. system crash, hard reset), then simply repeat it in the next mount session for this logical volume. Once the tiering stuff is stable, we'll implement automatic flushing of proxy device by a special kernel thread, which gets woken up every time when some block allocation happens on the proxy device. With currently existing interface it is also possible to organize flushing efficiently. In the simplest case it can be done by various scripts like the following one: while true do sync volume.reiser4 -b /mnt sleep 60 done Smarter script would check space occupied by data on the proxy brick before flushing, etc. = Removing a proxy device from a logical volume = At any time you are able to remove proxy device from your logical volume. It is in the assumption that no volume operations like adding/removing a device to/from this logical volume are in progress at the moment of proxy device removal. Removing a proxy device is absolutely similar to removing usual device form the logical volume. In particular, removal operation is always completed with data migration from the proxy-device to be removed to other devices-components of your logical volume. Before removing a proxy device make sure that there are enough space on other devices-components of your logical volume. Note that disk space of meta-data brick is not counted in the case when the last one is not a member of DSA (Data Storage Array), i.e. is used to store meta-data only. To remove proxy device from your logical volume simply execute volume.reiser4 -r DEV MNT where DEV is name of proxy device to be removed, MNT is mount point of your logical volume. The procedure of removing a proxy device can return errors: EBUSY means that the procedure of data migration gave way to some other process of higher priority in the competition for resources (usually long-term locks on the storage tree) ENOMEM means not enough memory for some data migration subroutine ENOSPC means not enough space on other devices The mentioned errors should be handled in the same way that in the case of flushing (see section above). If procedure of removing a proxy device was interrupted because of some reasons (system crash, or hard reset), then just follow [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#Deploying_a_logical_volume_after_hard_reset_or_system_crash instructions] on deploying a logical volume after interrupted device removal. = Monitoring a proxy device = Monitoring a proxy device is performed by usual means, see e.g. [https://reiser4.wiki.kernel.org/index.php/Logical_Volumes_Administration#LV_monitoring this]. In order to check free space on your proxy device execute volume.reiser4 MNT -p N where N is serial number of the proxy device in your logical volume mounted at MNT. Don't forget to sync dirty pages before this! Number of busy data blocks can be found as difference (blocks used - system blocks). = Using meta-data brick as proxy device = It is possible to use meta-data brick of your logical volume as a proxy device. Before this make sure that it is not used to store data (otherwise, operation on its adding as proxy device will fail). To check it simply execute: volume.reiser4 MNT -p 0 and check value of the field "in DSA". It should be "No". Otherwise, remove meta-data brick from data storage array by executing volume.reiser4 -r MTD_NAME MNT where MTD_NAME is device name of the meta-data brick, MNT is mount point of your logical volume. After removal completion add the meta-data brick as proxy device: volume.reiser4 -x MTD_NAME MNT WARNING: When using meta-data brick as proxy device, requirements on its flushing are especially high, because in the case of no free space on meta-data brick you are not able to create new files on your logical volume. 14dffc07740148431424f83920fce320d40cbf4f Publications 0 30 4046 2471 2015-03-11T22:36:16Z Chris goe 2 links fixed = Publications = * [[Future_Vision|Future Vision]] (Jan 25, 2001) * [[txn-doc|Reiser4 Transaction Design Document]] (Apr 05, 2002) * [[X0reiserfs|Three reasons why ReiserFS is great for you]] (2002) * [[V4|Reasons why Reiser4 is great for you]] (Nov 13, 2006) = Links = * [http://forums.gentoo.org/viewtopic-t-706171.html Reiser4 Gentoo FAQ] (2014-12-10) * [http://forums.gentoo.org/viewtopic-t-707465.html ReiserFS tuning thread] * [https://en.wikipedia.org/wiki/Reiser4 Reiser4 Wikipedia article] * [https://en.wikipedia.org/wiki/ReiserFS ReiserFS Wikipedia article] [[category:ReiserFS]] [[category:Reiser4]] 53bb8f062caef8cbbcbc908d2e0de3adf7b1208c 2471 1744 2012-09-25T17:34:30Z Chris goe 2 -> = Publications = * [[Future_Vision|Future Vision]] (Jan 25, 2001) * [[txn-doc|Reiser4 Transaction Design Document]] (Apr 05, 2002) * [[X0reiserfs|Three reasons why ReiserFS is great for you]] (2002) * [[V4|Reasons why Reiser4 is great for you]] (Nov 13, 2006) = Links = * [http://forums.gentoo.org/viewtopic-p-5200706.html Reiser4 Gentoo FAQ] * [http://forums.gentoo.org/viewtopic-t-707465.html Reiser4 Gentoo tuning discussion] * Reiser4 [http://en.wikipedia.org/wiki/Reiser4 Wikipedia article] * ReiserFS [http://en.wikipedia.org/wiki/ReiserFS Wikipedia article] [[category:ReiserFS]] [[category:Reiser4]] 2f7d49b5aad0104b27edea9272cdd446f5fe634f 1744 1734 2010-04-25T04:41:11Z Chris goe 2 reorg === Publications === * [[Future_Vision|Future Vision]] (Jan 25, 2001) * [[txn-doc|Reiser4 Transaction Design Document]] (Apr 05, 2002) * [[X0reiserfs|Three reasons why ReiserFS is great for you]] (2002) * [[V4|Reasons why Reiser4 is great for you]] (Nov 13, 2006) === Links === * [http://forums.gentoo.org/viewtopic-p-5200706.html Reiser4 Gentoo FAQ] * [http://forums.gentoo.org/viewtopic-t-707465.html Reiser4 Gentoo tuning discussion] * Reiser4 [http://en.wikipedia.org/wiki/Reiser4 Wikipedia article] * ReiserFS [http://en.wikipedia.org/wiki/ReiserFS Wikipedia article] [[category:ReiserFS]] [[category:Reiser4]] 60743f7e14a29be4c1f8d2a84732f8b5959d9f63 1734 1728 2010-04-25T04:20:54Z Chris goe 2 V4 added === Publications === * [[txn-doc|Reiser4 Transaction Design Document]] (Apr 05, 2002) * [[Future_Vision|Future Vision]] (Jan 25, 2001) * [[Reiser4|Reiser4 Whitepaper]] (Nov 13, 2006) * [[X0reiserfs|Three reasons why ReiserFS is great for you]] (2002) * [[V4|Reasons why Reiser4 is great for you]] (?) === Links === * [http://forums.gentoo.org/viewtopic-p-5200706.html Reiser4 Gentoo FAQ] * [http://forums.gentoo.org/viewtopic-t-707465.html Reiser4 Gentoo tuning discussion] * Reiser4 [http://en.wikipedia.org/wiki/Reiser4 Wikipedia article] * ReiserFS [http://en.wikipedia.org/wiki/ReiserFS Wikipedia article] [[category:ReiserFS]] [[category:Reiser4]] bd24b7d9279bedd0465fb15b7dfebfc21f7af359 1728 1675 2010-04-25T04:17:25Z Chris goe 2 X0reiserfs added, cleanup === Publications === * [[txn-doc|Reiser4 Transaction Design Document]] (Apr 05, 2002) * [[Future_Vision|Future Vision]] (Jan 25, 2001) * [[Reiser4|Reiser4 Whitepaper]] (Nov 13, 2006) * [[X0reiserfs|Three reasons why ReiserFS is great for you]] (2002) === Links === * [http://forums.gentoo.org/viewtopic-p-5200706.html Reiser4 Gentoo FAQ] * [http://forums.gentoo.org/viewtopic-t-707465.html Reiser4 Gentoo tuning discussion] * Reiser4 [http://en.wikipedia.org/wiki/Reiser4 Wikipedia article] * ReiserFS [http://en.wikipedia.org/wiki/ReiserFS Wikipedia article] [[category:ReiserFS]] [[category:Reiser4]] 648b47a55f11246e26728250fee48d93f7b09f37 1675 1670 2010-02-10T10:44:51Z Chris goe 2 gentoo forum posts added (thanks, Gentoo! :)) * [[txn-doc|Reiser4 Transaction Design Document]] (Apr 05, 2002) * [[Future_Vision|Future Vision]] (Jan 25, 2001) * [[Reiser4|Reiser4 Whitepaper]] (Nov 13, 2006) * [http://forums.gentoo.org/viewtopic-p-5200706.html Reiser4 Gentoo FAQ] * [http://forums.gentoo.org/viewtopic-t-707465.html Reiser4 Gentoo tuning discussion] * Reiser4 [http://en.wikipedia.org/wiki/Reiser4 Wikipedia entry] * ReiserFS [http://en.wikipedia.org/wiki/ReiserFS Wikipedia entry] [[category:ReiserFS]] [[category:Reiser4]] 1a47e0a4ea70d3695b80ade4316cbbd182f977f4 1670 1339 2010-02-10T06:10:36Z Chris goe 2 dates added * [[txn-doc|Reiser4 Transaction Design Document]] (Apr 05, 2002) * [[Future_Vision|Future Vision]] (Jan 25, 2001) * [[Reiser4|Reiser4 Whitepaper]] (Nov 13, 2006) * Reiser4 [http://en.wikipedia.org/wiki/Reiser4 Wikipedia entry] * ReiserFS [http://en.wikipedia.org/wiki/ReiserFS Wikipedia entry] [[category:ReiserFS]] [[category:Reiser4]] f20511b793b4a6c0a600a80179d1591c9ee8d36a 1339 2009-06-25T08:04:30Z Chris goe 2 Created page with '* [[txn-doc|Reiser4 Transaction Design Document]] (2002) * [[Future_Vision|Future Vision]] (2001) * Reiser4 [http://en.wikipedia.org/wiki/Reiser4 Wikipedia entry] * ReiserFS [htt...' * [[txn-doc|Reiser4 Transaction Design Document]] (2002) * [[Future_Vision|Future Vision]] (2001) * Reiser4 [http://en.wikipedia.org/wiki/Reiser4 Wikipedia entry] * ReiserFS [http://en.wikipedia.org/wiki/ReiserFS Wikipedia entry] [[category:ReiserFS]] [[category:Reiser4]] 4858f17ce11b4e99edf05f90873aea5aee5f0aa5 Reiser4 0 92 1735 1731 2010-04-25T04:21:45Z Chris goe 2 -> Category:Reiser4 #REDIRECT [[:Category:Reiser4]] [[Category:Reiser4]] db8a5bceb2d319066e61dd0632aafcd4b4ebe538 1731 2010-04-25T04:19:30Z Chris goe 2 moved [[Reiser4]] to [[V4]]:&#32;We'll use the Reiser4 page for something else #REDIRECT [[V4]] a1284538f9663ad869a081ea1cb28be54d2f360c Reiser4 Howto 0 17 4455 4402 2020-12-10T15:02:30Z Edward 4 /* Reiser4 */ = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-5.x/reiser4-for-5.8.10.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-5.8.10.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Do not enable debugging option: this is for developers only. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 (and greater) by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models|transaction model]], which is most suitable for you: {| class="wikitable" |- ! Mount option ! Description ! Intended for ! Default |- | txmod=journal | Classic journaling with wandering logs. All blocks of a transaction are overwritten. | HDD users, who performs a lot of random overwrites (e.g. databases) | no |- | txmod=wa | Classic Write-Anywhere. All blocks of a transaction except system ones get new location on disk. | SSD users | no |- | txmod=hybrid | Hybrid transaction model. Some blocks are overwritten, other ones are written to new location on disk. | HDD users, who don't perform a lot of random overwrites | yes |} Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] ba6c81cab849117ac0ee57a67b39ffde028ff22b 4402 4401 2020-10-25T13:27:10Z Edward 4 /* Reiser4 */ = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-5.x/reiser4-for-5.8.10.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-5.8.10.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Do not enable debugging option: this is for developers only. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 (and greater) by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models|transaction model]], which is most suitable for you: {| class="wikitable" |- ! Mount option ! Description ! Intended for ! Default |- | txmod=journal | Classic journaling with wandering logs. All blocks of a transaction are overwritten. | HDD users, who performs a lot of random overwrites (e.g. databases) | no |- | txmod=wa | Classic Write-Anywhere aka Copy-on-Write. All blocks of a transaction except system ones get new location on disk. | SSD users | no |- | txmod=hybrid | Hybrid transaction model. Some blocks are overwritten, other ones are written to new location on disk. | HDD users, who don't perform a lot of random overwrites | yes |} Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] ccf79714ac2e16778182086a13f9aa0e3dac22e8 4401 4297 2020-10-25T13:25:05Z Edward 4 /* Reiser4 */ = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-5.x/reiser4-for-5.8.10.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-5.8.10.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Do not enable debugging option: this is for developers only. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models|transaction model]], which is most suitable for you: {| class="wikitable" |- ! Mount option ! Description ! Intended for ! Default |- | txmod=journal | Classic journaling with wandering logs. All blocks of a transaction are overwritten. | HDD users, who performs a lot of random overwrites (e.g. databases) | no |- | txmod=wa | Classic Write-Anywhere aka Copy-on-Write. All blocks of a transaction except system ones get new location on disk. | SSD users | no |- | txmod=hybrid | Hybrid transaction model. Some blocks are overwritten, other ones are written to new location on disk. | HDD users, who don't perform a lot of random overwrites | yes |} Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 0c3e02eaa847a0ad224a0b00f58ee2c6b9a63b6f 4297 4295 2017-11-01T00:56:26Z Chris goe 2 use a real table for the transaction models = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.13.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.13.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Do not enable debugging option: this is for developers only. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models|transaction model]], which is most suitable for you: {| class="wikitable" |- ! Mount option ! Description ! Intended for ! Default |- | txmod=journal | Classic journaling with wandering logs. All blocks of a transaction are overwritten. | HDD users, who performs a lot of random overwrites (e.g. databases) | no |- | txmod=wa | Classic Write-Anywhere aka Copy-on-Write. All blocks of a transaction except system ones get new location on disk. | SSD users | no |- | txmod=hybrid | Hybrid transaction model. Some blocks are overwritten, other ones are written to new location on disk. | HDD users, who don't perform a lot of random overwrites | yes |} Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 1bc069d9f93e4311eb98655081c07583eb855684 4295 4293 2017-10-31T23:43:36Z Edward 4 Update "choosing a transaction mode" table = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.13.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.13.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Do not enable debugging option: this is for developers only. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ----------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ----------------------------------------------------------------------------------------------- txmod=journal Classic journaling HDD users, who performs a with wandering logs lot of random overwrites no (all blocks of a transaction (e.g. data bases) are overwritten) ----------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write (all blocks of a transaction except system ones get new location on disk) ----------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform (a part of blocks are a lot of random overwrites yes overwritten, other ones are written to new location on disk ----------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 521fa8f18630ee9df0f965a930c461f1198a16af 4293 4291 2017-10-31T23:35:41Z Edward 4 Update "choosing a transaction mode" table = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.13.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.13.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Do not enable debugging option: this is for developers only. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ----------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ----------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users, who performs a with wandering logs lot of random overwrites no (e.g. data bases) ----------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ----------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform (a part of blocks are a lot of random overwrites yes overwritten, other ones are written to new location on disk ----------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] c59796f2b5493b4f389c64c72f1d1aca069b5962 4291 4281 2017-10-04T01:48:50Z Edward 4 /* Reiser4 */ = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.13.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.13.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Do not enable debugging option: this is for developers only. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes provides parent-first order a lot of random overwrites on the storage tree nodes in terms of disk addresses ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 8d0e9e49e2b0f4c2b18ad6407730b783462721e5 4281 4279 2017-07-12T10:27:21Z Edward 4 /* Reiser4 */ = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.10.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.10.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Do not enable debugging option: this is for developers only. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes provides parent-first order a lot of random overwrites on the storage tree nodes in terms of disk addresses ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] b3339e81b032aed20ec9b40b828e0f359ecaaee1 4279 4273 2017-07-11T16:55:58Z Edward 4 More advises on using compression in reiser4 = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.10.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.10.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by mkfs option "create=reg40". Compression is highly recommended e.g. for root partitions, which contain system data. It is because default reiser4 intelligent compression heuristic works perfectly on a mix of well-compressible text files and non-compressible binaries. However, intelligent compression is suboptimal for large media-files (ISO images, MP4, etc). Currently it is impossible to specify compression per-file, or per-directory, so for large media-files we recommend to use a separate partition with disabled compression. Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes provides parent-first order a lot of random overwrites on the storage tree nodes in terms of disk addresses ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] b8f14fd2ecc846de4193280000246059bbabea97 4273 4241 2017-07-11T15:33:56Z Edward 4 More precise description of Hybrid Transaction Model = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.10.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.10.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. It can be suboptimal for large media files, etc. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes provides parent-first order a lot of random overwrites on the storage tree nodes in terms of disk addresses ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 774daa294b4a3bf82efbe1e0508356f259759269 4241 4219 2017-06-20T23:22:17Z Chris goe 2 +2 more links = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.10.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.10.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. It can be suboptimal for large media files, etc. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [[Reiser4 Mirrors and Failover]] * [[Debug Reiser4|Reiser4 Debugging]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 935f7060dc938da0fabda3804424a976e0f41510 4219 4099 2017-02-26T23:48:04Z Edward 4 /* Reiser4 */ = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.10.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.10.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. If your kernel older than 4.10, make sure that your operating system uses a swap partition of standard recommended size. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. It can be suboptimal for large media files, etc. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if something is going wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 0138e43aa78bc843c272148f3048fe04b058c343 4099 4097 2015-12-09T10:28:02Z Edward 4 Update = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.3.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-4.3.0.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] cbdc8dd08e380d9af126a290da95facd57f182f5 4097 4095 2015-12-09T10:26:14Z Edward 4 /* Reiser4 */ = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-4.x/reiser4-for-4.3.0.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 222c3543afd25989baf27c0edc2b521873cf8d14 4095 4089 2015-12-09T10:23:47Z Edward 4 /* Reiser4 */ = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.15.1.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. To protect your metadata by [[Reiser4_checksums | checksums]] use mkfs option "-o node=node41". If you create reiser4 on [[Reiser4_discard_support | SSD drive]], then use mkfs option "-d". NOTE: mkfs.reiser4 of version 1.1.0 by default turns intelligent compression on. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 225a17c483bda0f0daf1b0cb71370ac628d4d99e 4089 3981 2015-09-24T20:00:20Z Chris goe 2 +Reiser4 checksums = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.15.1.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. Use mkfs option option "-d" for [[Reiser4_discard_support | SSD drives]]. NOTE: mkfs.reiser4 of version 1.0.8 by default turns intelligent compression on. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [[Reiser4 checksums]] * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] f3552ca82fb3c19e75c3f607f101ada622c6ee13 3981 3971 2014-08-11T11:22:24Z Edward 4 Suggest reiser4-for-3.15.1.patch.gz = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.15.1.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. Use mkfs option option "-d" for [[Reiser4_discard_support | SSD drives]]. NOTE: mkfs.reiser4 of version 1.0.8 by default turns intelligent compression on. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 914e4b8775d43565e8cbf122ffaf1a4528a81e2e 3971 3961 2014-08-11T11:20:36Z Edward 4 Suggest mkfs option "-d" for SSD drives = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.14.1.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. Format your partition with mkfs.reiser4 utility. Use mkfs option option "-d" for [[Reiser4_discard_support | SSD drives]]. NOTE: mkfs.reiser4 of version 1.0.8 by default turns intelligent compression on. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 935333c2f229fdd0e4cb414b36e409087c259195 3961 3861 2014-08-11T11:14:10Z Edward 4 Suggest mount option "-o discard" for SSD drives = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.14.1.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. NOTE: mkfs.reiser4 of version 1.0.8 by default turns intelligent compression on. To disable compression, override it by the option "create=reg40". Choose a [[Reiser4_transaction_models| transaction model]], which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Mount your reiser4 partition. Use the mount option "-o discard" for SSD drives. More details are [[Reiser4_discard_support | here]]. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] ef558932d13a628d58558d6805db8c2eb61fe407 3861 3791 2014-06-05T23:15:47Z Edward 4 fix typo = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.14.1.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. NOTE: mkfs.reiser4 of version 1.0.8 by default turns intelligent compression on. To disable compression, override it by the option "create=reg40". Choose a transaction model, which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Look for [[Reiser4_transaction_models|more details]]. Mount your reiser4 partition. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 8ea5eadff03da8be9a44a7fdd5649cbf5359442e 3791 3781 2014-05-08T15:36:27Z Edward 4 = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.14.1.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. NOTE: mkfs.reiser4 of version 1.0.8 by default turns intelligent compression on. To disable compression, override it by the mount option "create=reg40". Choose a transaction model, which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Look for [[Reiser4_transaction_models|more details]]. Mount your reiser4 partition. [[Bugs|Report bugs]] if things go wrong. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] b57b39ac402bec3406f4995ea6645acf02389bbc 3781 2621 2014-05-08T15:23:14Z Edward 4 Update getting started with multiple transaction models = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.14.1.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.14.1.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot. Choose a transaction model, which is most suitable for you: ------------------------------------------------------------------------------------------- MOUNT OPTION DESCRIPTION INTENDED FOR DEFAULT ------------------------------------------------------------------------------------------- txmod=journal Classic journalling HDD users no with wandering logs ------------------------------------------------------------------------------------------- txmod=wa Classic Write-Anywhere SSD users no aka Copy-on-Write ------------------------------------------------------------------------------------------- txmod=hybrid Hybrid transaction model HDD users, who don't perform yes a lot of random overwrites ------------------------------------------------------------------------------------------- Look for [[Reiser4_transaction_models|more details]]. [[Bugs|Report bugs]] if things go wrong. To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] ed4564c89ebbb2d6e0aafe9b4100a45e020b119c 2621 2561 2012-11-01T23:10:19Z Edward 4 = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.6.4.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.6.4.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 7594b87ff541e43077c8a7b6ebf94e218d246fce 2561 2481 2012-09-25T18:01:10Z Chris goe 2 URLs fixed = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: wget http://downloads.sourceforge.net/project/reiser4/reiser4-for-linux-3.x/reiser4-for-3.5.3.patch.gz cd /usr/src/linux gzip -dc ~/reiser4-for-3.5.3.patch.gz | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] eedbc8ac5f9ff1f140909be42760e30918c278be 2481 1638 2012-09-25T17:35:53Z Chris goe 2 -> = Reiser4 = As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. = ReiserFS = Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. = Booting off a ReiserFS/Reiser4 partition = Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. = Links = * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] af00ce2213326853eb397a1a0204a8ab89a52ba4 1638 1637 2009-10-22T03:12:04Z Chris goe 2 reiserfsprogs can be resized too === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug/resize ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. === Booting off a ReiserFS/Reiser4 partition === Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. === Links === * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 813ebcedffbafca567fd0b59c838b5519997bd8c 1637 1636 2009-10-22T03:10:46Z Chris goe 2 Reiserfsprogs, Reiser4progs added === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! To create/check/debug Reiser4 filesystems, you'll need the [[Reiser4progs|reiser4progs]]. === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] To create/check/debug ReiserFS filesystems, you'll need the [[Reiserfsprogs|reiserfsprogs]]. === Booting off a ReiserFS/Reiser4 partition === Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. === Links === * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] b024fdaeefaf50847c2050c23f934d38d7665c5c 1636 1635 2009-10-22T03:05:23Z Chris goe 2 archive.org link added === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] === Booting off a ReiserFS/Reiser4 partition === Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. === Links === * [http://web.archive.org/web/20061113154749/www.namesys.com/install_v4.html Getting started with Reiser4] (from archive.org, 2006-11-13) [[category:ReiserFS]] [[category:Reiser4]] 67824a70e9179beacb7297e1a09ad224f8a6f7c8 1635 1566 2009-10-09T03:36:45Z Chris goe 2 be more clear about /boot === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] === Booting off a ReiserFS/Reiser4 partition === Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition for <tt>/boot</tt> (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. [[category:ReiserFS]] [[category:Reiser4]] e7445cffd12f4f7dfc823c92721ee814088a92af 1566 1541 2009-07-03T19:54:00Z Chris goe 2 some notes what we mean here. === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] === Booting off a ReiserFS/Reiser4 partition === Booting ''off a ReiserFS/Reiser4 partition'': what we mean here is that the kernel (usually a file in <tt>/boot</tt>) is actually located on a ReiserFS/Reiser4 partition. If you have a separate partition (e.g. a (readonly-mounted) ext2 partition at the beginning of your disk) and your ''root-filesystem'' is on a ReiserFS/Reiser4 partition, you only need to make sure that ReiserFS/Reiser4 support is enabled in your kernel - but that's true for every filesystem and has nothing to to with the bootloader. As far as the writer is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To install GRUB on a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. [[category:ReiserFS]] [[category:Reiser4]] f5c60dfe9c61a61533d73374a73dbec77d4915ac 1541 1374 2009-07-02T18:45:58Z Chris goe 2 Reiser4_Howto/GRUB === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] === Booting off a ReiserFS/Reiser4 partition === As far as the reader is informed, booting off a ReiserFS partition is fully supported by [http://freshmeat.net/projects/lilo/ LiLo] or [http://www.gnu.org/software/grub/grub.html GRUB]. For Reiser4, LiLo is [http://wiki.archlinux.org/index.php/Reiser4FShowto#Packages known to work] out of the box. To use GRUB with a Reiser4 partition, [[Reiser4_Howto/GRUB|a few more steps are needed]]. [[category:ReiserFS]] [[category:Reiser4]] 8ac4825e9cc15dbc612c1b943ce2d7eade98e22e 1374 1322 2009-06-25T09:49:15Z Chris goe 2 link wikified === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [[Reiser4_patchsets|patch]] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] [[category:ReiserFS]] [[category:Reiser4]] 1864a56ada79d1045faa395e0ec26c564ee846f9 1322 1316 2009-06-25T07:25:05Z Chris goe 2 /* ReiserFS */ === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [http://reiser4.wiki.kernel.org/index.php/Reiser4_patchsets patch] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. * [[ReiserFS/kerneloptions | Compile-Time Options for Configuring ReiserFS]] [[category:ReiserFS]] [[category:Reiser4]] 7fe859079068ac50ad380bb14fb81aeb4c70b241 1316 2009-06-25T07:17:09Z Chris goe 2 Created page with '=== Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [http://reiser4.wiki.kernel.org/index.php/Reiser4_patchsets patch] to get this working: ...' === Reiser4 === As <tt>reiser4</tt> is not in mainline yet, we have to apply the right [http://reiser4.wiki.kernel.org/index.php/Reiser4_patchsets patch] to get this working: $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2 $ wget http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/reiser4-for-2.6.30.patch.bz2.sign $ gpg --verify reiser4-for-2.6.30.patch.bz2.sign reiser4-for-2.6.30.patch.bz2 gpg: Signature made Tue 23 Jun 2009 12:43:47 AM CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ cd /usr/src/linux $ bzip2 -dc ~/reiser4-for-2.6.30.patch.bz2 | patch -p1 Now enable <tt>CONFIG_REISER4_FS</tt> and build (and install) your kernel. Reboot, and don't forget to [[Bugs|report bugs]] if things go wrong. Even more important: don't forget your backups before messing with your filesystems! === ReiserFS === Since <tt>reiserfs</tt> is in mainline, just enable the following options in your kernel <tt>.config</tt>: CONFIG_REISERFS_FS CONFIG_REISERFS_FS_XATTR (optional) CONFIG_REISERFS_FS_POSIX_ACL (optional) CONFIG_REISERFS_FS_SECURITY (optional) Todays distributions should have this options enabled already, no need to build your kernel. However, not every Linux distribution ''supports'' <tt>reiserfs</tt>. But if you disregard your distribution's recommended settings, you'll probably know what you're doing anyway. [[category:ReiserFS]] [[category:Reiser4]] 57ba8213e1b8f5b9de45f55529ad0d88ca7bf712 Reiser4 Howto/GRUB 0 58 1617 1616 2009-07-21T06:37:40Z Chris goe 2 s/much/many <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal $ make && sudo make install $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ tar -xzf reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make && sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/reiser4progs-1.0.6/include" \ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs-1.0.6/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs-1.0.6/lib" \ ./configure --prefix=/opt/grub-r4 $ make && sudo make install Strangely enough, the resulting GRUB binary seems to be linked to <tt>libaal</tt>, twice: $ ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => not found libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb7e69000) $ LD_LIBRARY_PATH=/opt/libaal/lib ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb807f000) Maybe I've added to many <tt>*FLAGS</tt> during <tt>./configure</tt>.... === GRUB v2 === * 2009-06-17 - Reiser4 support for GRUB 2 ist still on the [http://grub.enbug.org/TodoList TODO list] * [http://lists.gnu.org/archive/html/grub-devel/2008-02/msg00263.html 2008-02-09] - [http://www.linkedin.com/pub/yuriy-umanets/6/49a/856 Yuriy Umanets] asks if anybody is interested in Reiser4 support for GRUB 2, nobody replies :-\ === Misc === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) * [http://bugs.gentoo.org/46410 Gentoo #46410] - reiser4 support for grub * [http://wiki.archlinux.org/index.php/Reiser4FShowto Reiser4FShowto in ArchLinux] [[category:Reiser4]] 1ae31c0f0ca36dc933264710591b3ea7b37bb815 1616 1584 2009-07-21T06:35:48Z Chris goe 2 links added <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ tar -xzf reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make && sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/reiser4progs-1.0.6/include" \ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs-1.0.6/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs-1.0.6/lib" \ ./configure --prefix=/opt/grub-r4 $ make && sudo make install Strangely enough, the resulting GRUB binary seems to be linked to <tt>libaal</tt>, twice: $ ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => not found libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb7e69000) $ LD_LIBRARY_PATH=/opt/libaal/lib ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb807f000) Maybe I've added to much <tt>*FLAGS</tt> during <tt>./configure</tt>.... === GRUB v2 === * 2009-06-17 - Reiser4 support for GRUB 2 ist still on the [http://grub.enbug.org/TodoList TODO list] * [http://lists.gnu.org/archive/html/grub-devel/2008-02/msg00263.html 2008-02-09] - [http://www.linkedin.com/pub/yuriy-umanets/6/49a/856 Yuriy Umanets] asks if anybody is interested in Reiser4 support for GRUB 2, nobody replies :-\ === Misc === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) * [http://bugs.gentoo.org/46410 Gentoo #46410] - reiser4 support for grub * [http://wiki.archlinux.org/index.php/Reiser4FShowto Reiser4FShowto in ArchLinux] [[category:Reiser4]] 4d6775a1daf9ec2e910875cb42cd78706accead2 1584 1583 2009-07-04T22:08:42Z Chris goe 2 he has a picture too :) <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ tar -xzf reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make && sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/reiser4progs-1.0.6/include" \ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs-1.0.6/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs-1.0.6/lib" \ ./configure --prefix=/opt/grub-r4 $ make && sudo make install Strangely enough, the resulting GRUB binary seems to be linked to <tt>libaal</tt>, twice: $ ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => not found libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb7e69000) $ LD_LIBRARY_PATH=/opt/libaal/lib ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb807f000) Maybe I've added to much <tt>*FLAGS</tt> during <tt>./configure</tt>.... === GRUB v2 === * 2009-06-17 - Reiser4 support for GRUB 2 ist still on the [http://grub.enbug.org/TodoList TODO list] * [http://lists.gnu.org/archive/html/grub-devel/2008-02/msg00263.html 2008-02-09] - [http://www.linkedin.com/pub/yuriy-umanets/6/49a/856 Yuriy Umanets] asks if anybody is interested in Reiser4 support for GRUB 2, nobody replies :-\ === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] e8e253d65851a9d9df18d3bf5e21a742ca1f0a8e 1583 1582 2009-07-04T22:00:13Z Chris goe 2 no reiser4 for grub2 <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ tar -xzf reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make && sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/reiser4progs-1.0.6/include" \ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs-1.0.6/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs-1.0.6/lib" \ ./configure --prefix=/opt/grub-r4 $ make && sudo make install Strangely enough, the resulting GRUB binary seems to be linked to <tt>libaal</tt>, twice: $ ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => not found libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb7e69000) $ LD_LIBRARY_PATH=/opt/libaal/lib ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb807f000) Maybe I've added to much <tt>*FLAGS</tt> during <tt>./configure</tt>.... === GRUB v2 === * 2009-06-17 - Reiser4 support for GRUB 2 ist still on the [http://grub.enbug.org/TodoList TODO list] * [http://lists.gnu.org/archive/html/grub-devel/2008-02/msg00263.html 2008-02-09] - [http://www.google.com/profiles/106848364858132232289 Yuriy Umanets] asks if anybody is interested in Reiser4 support for GRUB 2, nobody replies :-\ === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] 5466fa425322a00630a0b3330c520454c437ea53 1582 1581 2009-07-04T21:43:32Z Chris goe 2 FLAGS overkill <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ tar -xzf reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make && sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/reiser4progs-1.0.6/include" \ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs-1.0.6/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs-1.0.6/lib" \ ./configure --prefix=/opt/grub-r4 $ make && sudo make install Strangely enough, the resulting GRUB binary seems to be linked to <tt>libaal</tt>, twice: $ ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => not found libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb7e69000) $ LD_LIBRARY_PATH=/opt/libaal/lib ldd /opt/grub-r4/sbin/grub | grep libaal libaal-minimal.so.0 => /opt/libaal/lib/libaal-minimal.so.0 (0xb807f000) Maybe I've added to much <tt>*FLAGS</tt> during <tt>./configure</tt>.... === GRUB v2 === TBD, GRUB 2 support is still not ready. === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] becbd0f0d6dc0001a6585909e79ede5bdcf1de1a 1581 1580 2009-07-04T21:36:07Z Chris goe 2 517D0F0E imported for libaal as well <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ tar -xzf reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make && sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/reiser4progs-1.0.6/include" \ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs-1.0.6/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs-1.0.6/lib" \ ./configure --prefix=/opt/grub-r4 $ make && sudo make install === GRUB v2 === TBD, GRUB 2 support is still not ready. === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] 1f0ed7e902f436fdb42f41bf3ac9818a6aafcd47 1580 1579 2009-07-04T20:23:56Z Chris goe 2 libaal added, unpack reiser4progs-1.0.6.tar.gz <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ tar -xzf reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make && sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/reiser4progs-1.0.6/include" \ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs-1.0.6/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs-1.0.6/lib" \ ./configure --prefix=/opt/grub-r4 $ make && sudo make install === GRUB v2 === TBD, GRUB 2 support is still not ready. === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] c3c4df3bd767c5bef526bb743032f73c5ad68c01 1579 1577 2009-07-04T20:16:27Z Chris goe 2 grub compilation fixed. <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make && sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CFLAGS="<font color="red">-fno-stack-protector</font> -I/opt/reiser4progs-1.0.6/include" \ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs-1.0.6/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs-1.0.6/lib" \ ./configure --prefix=/opt/grub-r4 $ make && sudo make install === GRUB v2 === TBD, GRUB 2 support is still not ready. === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] 7cbd13a1528e73f3f45178f367bf67f6645da139 1577 1565 2009-07-04T20:00:41Z Chris goe 2 reiser4progs-1.0.6 compilation <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. Unfortunately, comilation was not that easy, see the [[{{TALKPAGENAME}}|talkpage for details]], in short: * compilation against <tt>reiser4progs-1.0.7</tt> fails * add <tt>-fno-stack-protector</tt> to disable [https://wiki.ubuntu.com/GccSsp SSP], since this seems to be the default for Ubuntu. Well, here is the whole procedure again; you may have to adjust the pathnames used here: $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz $ wget wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.6.tar.gz.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify reiser4progs-1.0.6.tar.gz.sign reiser4progs-1.0.6.tar.gz $ cd reiser4progs-1.0.6 $ sed '999 s/^#elif/#else/' -i plugin/node/node40/node40.c $ CFLAGS="-fno-stack-protector -I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs-1.0.6 $ make $ sudo make install $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs/lib" \ ./configure --prefix=/opt/grub-r4 $ make $ sudo make install === GRUB v2 === TBD, GRUB 2 support is still not ready. === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] 287164c53dd510b42f925a85ebe78598c5b897d8 1565 1554 2009-07-03T19:43:41Z Chris goe 2 build libaal, reiser4progs first <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === Before building GRUB with [[Reiser4]] support, we have to [[Reiser4progs|build and install (<tt>libaal</tt> and) <tt>reiser4progs</tt>]]. Once we have done that, we can build GRUB. You may have to adjust the pathnames used here: $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ CPPFLAGS="-I/opt/libaal/include -I/opt/reiser4progs/include" \ LDFLAGS="-L/opt/libaal/lib -L/opt/reiser4progs/lib" \ ./configure --prefix=/opt/grub-r4 $ make $ sudo make install === GRUB v2 === TBD, GRUB 2 support is still not ready. === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] 49b3c3c13cdc553ab88e2a83300eac6fe2903f0f 1554 1545 2009-07-02T21:17:25Z Chris goe 2 md5sum added <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ ./configure --prefix=/opt/grub-r4 $ make $ sudo make install === GRUB v2 === TBD, GRUB 2 support is still not ready. === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] (MD5: 423d04e95c4c2d90b840f67e8a3a5024) [[category:Reiser4]] 2b4e8e7438ec613e76935aa7eff53ff50085f7ec 1545 1543 2009-07-02T19:12:08Z Chris goe 2 Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt added <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ ./configure --prefix=/opt/grub-r4 $ make $ sudo make install === GRUB v2 === TBD, GRUB 2 support is still not ready. === Files === * [[file:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt]] [[category:Reiser4]] af57d1cb6a5c4fbbedeb37152e50578cd2e544b4 1543 2009-07-02T18:58:47Z Chris goe 2 Created page with '<font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.or...' <font color=red>NOTE: this is currently WIP '''and NOT TESTED AT ALL'''</font> === GRUB === $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz $ wget http://alpha.gnu.org/gnu/grub/grub-0.97.tar.gz.sig $ gpg --recv-keys [http://savannah.gnu.org/project/memberlist-gpgkeys.php?group=grub FE06BDEF] $ gpg --verify grub-0.97.tar.gz.sig grub-0.97.tar.gz $ tar -xzf grub-0.97.tar.gz $ cd grub-0.97 $ patch -p1 < ../[[Media:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt|grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch]] $ ./configure --prefix=/opt/grub-r4 $ make $ sudo make install === GRUB v2 === TBD, GRUB 2 support is still not ready. [[category:Reiser4]] 9c8dca31388519e3a4f1e13a45c0d0c859718cab Reiser4 Hybrid transaction model 0 1109 4285 2017-07-21T23:46:05Z Edward 4 Added Reiser4 Hybrid Transaction Model page Reiser4 transaction model is a high-level block allocator. User can specify (by mount option) any of 3 currently existing transaction models. First and second ones implement classic journaling and write-anywhere (Copy-on-Write) transaction models. They are implemented as a hard-coded models in various file systems. Here we describe a unique Hybrid Transaction Model suggested by Joshua Macdonald and Hans Reiser in 2001. = Relocation Decisions (relocate or overwrite) = In this transaction model a part of atom's blocks is scheduled to be relocated (RELOCATE_SET). Another part - to be overwritten (OVERWRITE_SET). All system blocks (super-block, bitmap blocks, etc) for obvious reasons always fall to OVERWRITE set. All new nodes (which haven't had block numbers yet) always fall to RELOCATE_SET. For other blocks of the atom the relocation decision is based on the statistics accumulated by a special scanning procedure. A positive decision (relocate) is made in the case if enough (FLUSH_RELOCATE_THRESHOLD) nodes were found for flushing (see comments and source code in flush.c for details). = Block allocation policy for RELOCATE_SET = To allocate blocks for some set of nodes means to arrange mapping from that set to the space of disk addresses. Let m be some mapping of linearly ordered sets A and B (m: A -> B). '''Definition 1.''' Mapping m is said to keep linear order on A, iff the following implication is true: for all a in A, b in B (a < b) => m(a) < m(b) '''Comment.''' "<" in the first part of the implication means linear order on A. "< in the second part of implication means linear order on B. The important property of Hybrid Transaction Model is that this model keeps parent-first order on the RELOCATE_SET considered as a tree. '''Definition 2.''' Parent-first order on a tree is a linear order on tree nodes determined by the following recursive procedure: void parent_first(node) { print_node (node); if (node->level > leaf) { for (i = 0; i < num_children; i += 1) parent_first (node->child[i]); } } Specifically, (node1 < node2) in the terms of parent-first order means that the procedure above prints node1 '''before''' node2. '''Comment.''' The definition above assumes that all tree nodes on the same level were ordered by some linear manner (by the array child[i]). In our case the mentioned order is determined by their natural enumeration "from left to right" in the tree. So, Reiser4 hybrid transaction model keeps parent-first order on the RELOCATE_SET. Specifically, it means that the following implication is true: (node1 < node2) => (block_nr(node1) < block_nr(node2)) "<" in the second part of implication means usual linear order on the set of disk addresses (block numbers). It is achieved by finding and setting a "preceder" for the RELOCATE_SET at the very beginning of its relocation. Preceder is the maximal block number (disk address) with the following properties: 1) it is marked "busy" in the space map; 2) it is smaller than all old block numbers of the given RELOCATE_SET. Further, during allocation of new block numbers for RELOCATE_SET the variable containing the preceder gets updated and is passed as a "hint" to the low-level space allocator. [[category:Reiser4]] e97c2b9ac99a1344e8ee22c9d3147f60bcacf231 Reiser4 Mirrors and Failover 0 1099 4215 4213 2017-02-20T15:32:35Z Edward 4 = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID-1 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). Reiser4 doesn't need scrub, inherent to poorly designed file systems. = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All the specified devices should have the same size in sectors. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors will be useful not more than block layer's RAID-1. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences between the debugfs outputs. Register the original subvolume: mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. [[category:Reiser4]] ffcc6317c2b2d94a2724dca6965758589c35e24b 4213 4211 2017-02-20T14:42:07Z Edward 4 /* Failover */ = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID-1 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). Reiser4 doesn't need scrub, inherent to poorly designed file systems. = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All the specified devices should have the same size in sectors. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors will be useful not more than block layer's RAID-1. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. [[category:Reiser4]] f6176d470688f905eb42ab5e716e0f36e728b19d 4211 4209 2017-02-20T14:34:37Z Edward 4 /* How to test the feature */ = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID-1 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All the specified devices should have the same size in sectors. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors will be useful not more than block layer's RAID-1. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. [[category:Reiser4]] 50fb5f7a8d6932f66b49cef6227067f71a29d235 4209 4207 2017-02-20T14:33:30Z Edward 4 Fixed typo = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID-1 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All the specified devices should have the same size in sectors. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID-1. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. [[category:Reiser4]] 9028e6896088eb3a33a552245af3ca6421dc8532 4207 4195 2017-02-20T14:30:28Z Edward 4 Fixed typo = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID-1 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. [[category:Reiser4]] 90f83ffa5faeda8a41750f6f895e687ccb79c184 4195 4193 2016-11-16T20:45:38Z Edward 4 = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. [[category:Reiser4]] 1f4d1602266fecd12f448ef09b6b8675b4399477 4193 4191 2016-11-16T15:52:09Z Edward 4 = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. e321dfa04b4a2824bcc050c4bb157b1a3eca40bb 4191 4189 2016-11-16T15:49:44Z Edward 4 = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. baf50e3f163badab86dbd3be379e63b27dc4d76a 4189 4187 2016-11-16T15:46:50Z Edward 4 /* Mirror operations */ = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So mirrors (an original subvolume with all its replicas) actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. 78808b2c7681721c5097805ec80290656c4a9446 4187 4185 2016-11-16T15:44:45Z Edward 4 /* Mirror operations */ = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So original and mirrors actually represent RAID-1 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. 63c39110ebad00f99a0a2d73afbb4d48504dd3db 4185 4183 2016-11-16T15:22:21Z Edward 4 = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So original and mirrors actually represent RAID0 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately re-issues read IO request(s) against replica device(s). = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. 8b690aa0d7678ac2fd727cd60bc9727d3bbbb774 4183 4181 2016-11-16T15:19:28Z Edward 4 Fix up formatting = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So original and mirrors actually represent RAID0 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately issues read IO requests against replica devices. = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. c7147f97c7d26208885a76bdbe6299d7a60cb2ce 4181 4179 2016-11-16T15:17:01Z Edward 4 Add a "Failover" section = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So original and mirrors actually represent RAID0 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = Failover = Every time when a block is loaded from disk to memory, Reiser4 verifies its checksum. If checksum verification failed, then Reiser4 immediately issues read IO requests against replica devices. = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. ec9ed8ab38fa7ec092b8df21fcf12cae3e6a8d93 4179 4177 2016-11-16T15:08:50Z Edward 4 Fix up formatting = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So original and mirrors actually represent RAID0 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. e83a0170d0aa636b8668db4bb64ff42869d65025 4177 2016-11-16T15:07:25Z Edward 4 Add Reiser4 Mirrors and Failover page = Logical Volumes = Reiser4 will support logical (compound) volumes. For now we have implemented the simplest ones - mirrors. As a supplement to existing checksums it will provide a failover - an important feature, which will reduce number of cases when your volume needs to be repaired by fsck. Reiser4 subvolume is a component of logical volume. Subvolume is always associated with a physical, or logical (built of RAID, LVM, etc means) block device. Every subvolume possesses: . volume ID; . subvolume ID; . mirror ID; . number of replicas. mirror ID is a serial number from 0 till 65535. Subvolume with mirror ID 0 has a special name - original. Other ones are called replicas. We use to say "original A has a replica B" (or "B replicates A", which is the same), iff A and B possess the same subvolume ID. Original with all its replicas are called "mirrors". For subvolumes we have introduced a special disk format plugin "format41". In accordance with Reiser4 development model it means forward incompatibility. We have introduced it intentionally, for protection. Indeed, for clear reasons users must not have possibility to RW-mount separate replicas (without originals). The multi-device extension is backward compatible: all volumes of the old format (format40) are supported as logical volumes composed of only one (original) subvolume. = Registration and activation of subvolumes = For now every Reiser4 logical volume has only one original subvolume. Number of replicas can be 0, or more. Logical volume can be mount by usual mount command. Simply specify any its subvolume (the original, or some its replica). The only condition is that original and all its replicas should be registered in the system. If original, or some its replica are not registered, then mount will fail with a respective kernel message. Currently there is no tool to register specified subvolume (TBD). However, mount command always tries to register the specified device. The registration policy is "sticky". It means that your device won't be unregistered after umount, as well as failed mount. (You will be able to unregister it mandatory by a special tool - TBD). Procedure of registration reads the master super-block of the subvolume and puts the subvolume header to a specilal list of registered subvolumes. Mounting a logical volume activates all its registered components. Procedure of activation reads format super-block of the subvolume, and performs other actions like initialization of space maps, transaction replay, etc. as specified by the method ->init_format() of respective disk format plugin. Pointer to an activated subvolume is placed to a special table of active subvolumes. = Mirror operations = So original and mirrors actually represent RAID0 on the filesystem level. COMMENT. We aren't engaged in marketing fraud on collecting all features of the block layer's RAID and LVM. Reiser4 mirrors implement a failover, that block layers's RAID0 is not able to provide. It will be possible to "upgrade", or "downgrade" a reiser4 array of mirrors by attaching / detaching online one, or more replicas by special user-space tools (mirror.reiser4, TBD). Also by those tools it will be possible to swap original with any its replica, or make a new original from any replica, if the old one is lost for some reasons. Fsck will refuse to check/repir replica. Fsck is supposed to work only with original subvolumes. After mounting an fsck-ed original, kernel will automatically run a special on-line backgroud procedure (scrub) in order to synchronize the repaired original with all its replicas. Once in a while user has to check his array of mirrors by running scrub in the background mode. WARNING: Bear in mind once and forever: Replica is not a backup!!! = Technical Notes = 1. Reiser4 Transaction Design document is transferred to logical volumes without any modifications, but with a small addition. Atom is now composed of per-subvolume components. 2. By design all mirrors differ only in mirror-IDs which are stored in master super-block. Format super-blocks of mirrors are identical. This approach provides best performance and full parallelism in issuing IO requests for mirrors. The minus is a small compromise in design, according to which master super-block doesn't participate in transactions. It means that mirror operations on upgrading/degrading/ swapping can not spawn usual transactions, which can be committed and (re)played using existing transaction manager. That is, mirror operations won't survive a system crash. If a system crash happens during a mirror operation, then the mirror structure should be checked/fixed offline by the mirror tools (kernel will refuse to mount unchecked array of mirrors). Fortunately, all critical mirror operations issue small number of IO requests, so that probability of their interruption is close to zero. 3. We don't commit transactions on all mirrors, only on the original subvolume (this is the single functional difference of original and its replicas). Transaction (re)play, of course, is going on all mirrors using the wandering maps/blocks of the original subvolume. = How to test the feature = Checkout branch "format41" of the upstream reiser4 and reiser4progs git repos on https://github.com/edward6 Build and install as usual. Mirrors can be created by mkfs.reiser4 option -m. If this option is specified, then the first listed device will be the original, other ones - replicas. All devices of an array should have the same size. Further we'll avoid that restriction. IMPORTANT: when creating mirrors specify node41 plugin (with checksum support). Otherwise, your mirrors won't be more useful than block layer's RAID0. Register all your mirrors, trying to "mount" them one-by-one in any order. If you have N mirrors (i.e. one original and N-1 replicas), then first N-1 mount commands will fail. Of course, it is not too graceful, but this is temporal solution. The N-th "attempt" should succeed. Have a fun. Unmount as usual. = Example = Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal size. Let's create an array of 2 mirrors: mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 Take a look at original subvolume: debugfs.reiser4 /dev/sda7 Take a look at replica: debugfs.reiser4 /dev/sda8 Find differences Register the original subvolume mount /dev/sda7 /mnt mount: wrong fs type, bad option, bad superblock blablabla.... dmesg reiser4[mount(20914)]: check_active_replicas (fs/reiser4/init_volume.c:268)[edward-1750]: WARNING: /dev/sda7 requires replicas, which are not registered. Register the replica and mount the array: mount /dev/sda8 /mnt dmesg reiser4: registered subvolume (/dev/sda8) reiser4 (sda8): found disk format 4.0.1. reiser4 (/dev/sda7): using Hybrid Transaction Model. Let's copy a file /etc/services to our array of mirrors: cp /etc/services /mnt/. Unmount the array: umount /mnt Find a root block: it goes the first in the tree dump: debugfs.reiser4 -t /dev/sda7 In our case the root block has blocknumber #79 Let's now take a look on how our failover works. The death defying act: we erase the root block of the original subvolume: dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 We know that the mount procedure load the root block. Let's try to mount our array with the corrupted root block: mount /dev/sda8 /mnt Everything works.. Take a look at kernel messages: dmesg reiser4[mount(21224)]: __jload_gfp_failover[edward-1811]: WARNING: block 79 (/dev/sda7) looks corrupted. NOTICE: Loading from replica device /dev/sda8. = TODO = 1) Mirror tools (upgrade/downgrade/synchronize an array of mirrors, swap original and specified replica, convert replica to an original subvolume, visualization of mirror arrays, etc); 2) Checksumming format super-block; 3) Issuing discard requests for replicas on SSD devices. a568edea60b40f8cd03c969b261117c63bbd34b2 Reiser4 checksums 0 1090 4069 4068 2015-08-30T15:37:15Z Edward 4 = Why protect (meta)data? = We want to be protected against hardware problems such as data rot in memory and decay of storage media. We want to be sure that our data structures are consistent, because working with corrupted data structures is dangerous. Strictly speaking, such protection is not a business of a file system. It would be more logical to assume that this is a business of the upper and the lower subsystems. To be precisely, protection against data rot in memory is a business of the memory controller, and protection against decay of storage media is a business of the block device controller/driver. However, frequently the mentioned subsystems don't provide such protection for various reasons. As the result the file system suffers (becomes corrupted, inconsistent), and poor users start to blame file system developers. = Why "inline" checksums? = Reiser4 stores per-node checksum right in the node that we want to protect. This is much more efficient than using dedicated data structures for checksums, as we don't need to launch expensive search procedures every time when we need to access a checksum. Using dedicated data structures to store checksums is a design mistake. = When we check/update per-node checksums? = Let's start from protection against storage media decay. If someone wants protection against data rot in memory, then let me know. Since we implement protection against storage media decay, it is enough to check [update] a checksum right after IO completion [before submitting IO request]. We don't need to update a checksum after every modification. So, updating checksums in Reiser4 is a delayed action. Reiser4 updates per-node checksum at commit time right before writing the node to disk. At the moment of checksum update any process modifying this node will be blocked on an attempt to acquire an exclusive access: longterm_lock_znode -> try_capture_block Thus, updated checksum won't be "spoiled" before hitting the disk. Checksum verification is going right after read IO completion in the ->parse() method of node plugin. = How we handle corruptions = If node's checksum verification failed, them further working with such node is dangerous. In this case the partition will be remounted by kernel as readonly with a suggestion in kernel logs to check it by fsck. TODO: Online failover mode is in plans. For this mode we need to support mirror(s). Every in-memory replica gets updated at the moment of the checksum update. At the finish of transaction commit all replicas have to be written to the mirror. If checksum verification failed, then we issue a read IO request for the replica block of the mirror. Comment. Mirrors can be internal (when we allocate replicas on the same partition) and external (when we allocate replicas on different device). = Why use crc32c for checksums? = Modern CPUs have instructions, which allow to compute a full 32-bit CRC step in 3 cycles. = How to protect data? = Currently we don't support checksums for unformatted blocks, where bodies of large files are stored. If you want to protect your data (not only metadata), then you have 3 options: 1) Make sure that reiser4 stores bodies of your files in fragments (i.e. "inline" data chunks). Fragments are always stored in formatted nodes, which are protected by checksums. It is possible with mkfs option "formatting=tails" for files managed by unix_file plugin (if you don't use compression) or "compressMode=latt" for files managed by cryptcompress plugin (if you use compression). NOTE. This option will lead to performance degradation (especially for delete operations). 2) Protect your data by yourself. If a file system guarantees consistency of metadata, then data protection can be successfuly implemented in the user-space. Indeed, since file body is uniquely determined by extent pointers, which are guaranteed to be consistent, then checking consistency of the file's body in user space is always a correct operation. So, feel free to check your data in the user-space: we have provided basis for this. 3) Implement checksums for unformatted nodes in reiser4. This option requires a new format for extent pointers (which will include a 32-bit field for checksum), and, respectively, a new item plugin (extent-pointer-with-checksum, or so). = How to enable checksum support in reiser4 = Specify node plugin with protection by mkfs.reiser4 option "'''-o node=node41'''" when formatting your partition and mount as usual. = Compatibility with other features = Checksums are compatible with all reiser4 features. Adding a checksum support is a great example of how reiser4 resists the problem of creeping featurism. We just added a new node plugin, which manages nodes of a new format (node41) with a 32-bit field for the checksum. The new plugin mostly reuses methods of the old one (node40) as you can see from the patches for [http://marc.info/?l=reiserfs-devel&m=142359111509525&w=2 reiser4] and [http://marc.info/?l=reiserfs-devel&m=142359112409527&w=2 reiser4progs] = TODO = A. Failover via mirroring (see section 4 for implementation hints). B. Maintain checksums for the superblock and bitmap blocks. Comment. We already have such support for bitmap blocks, however, it uses adler32 and checksums update/verification is not invoked for some historical reasons. I suggest to replace adler32 with crc32c and trigger the update/verification. Comment. For superblock protection we need to add a 32-bit field to the disk superblock and update/verify it like in the case of formatted nodes. [[category:Reiser4]] 57d96b40e96c7db396835f5511221235b9b70755 4068 4066 2015-08-30T15:28:31Z Edward 4 Formatting changes = 1. Why protect (meta)data? = We want to be protected against hardware problems such as data rot in memory and decay of storage media. We want to be sure that our data structures are consistent, because working with corrupted data structures is dangerous. Strictly speaking, such protection is not a business of a file system. It would be more logical to assume that this is a business of the upper and the lower subsystems. To be precisely, protection against data rot in memory is a business of the memory controller, and protection against decay of storage media is a business of the block device controller/driver. However, frequently the mentioned subsystems don't provide such protection for various reasons. As the result the file system suffers (becomes corrupted, inconsistent), and poor users start to blame file system developers. = 2. Why "inline" checksums? = Reiser4 stores per-node checksum right in the node that we want to protect. This is much more efficient than using dedicated data structures for checksums, as we don't need to launch expensive search procedures every time when we need to access a checksum. Using dedicated data structures to store checksums is a design mistake. = 3. When we check/update per-node checksums? = Let's start from protection against storage media decay. If someone wants protection against data rot in memory, then let me know. Since we implement protection against storage media decay, it is enough to check [update] a checksum right after IO completion [before submitting IO request]. We don't need to update a checksum after every modification. So, updating checksums in Reiser4 is a delayed action. Reiser4 updates per-node checksum at commit time right before writing the node to disk. At the moment of checksum update any process modifying this node will be blocked on an attempt to acquire an exclusive access: longterm_lock_znode -> try_capture_block Thus, updated checksum won't be "spoiled" before hitting the disk. Checksum verification is going right after read IO completion in the ->parse() method of node plugin. = 4. How we handle corruptions = If node's checksum verification failed, them further working with such node is dangerous. In this case the partition will be remounted by kernel as readonly with a suggestion in kernel logs to check it by fsck. TODO: Online failover mode is in plans. For this mode we need to support mirror(s). Every in-memory replica gets updated at the moment of the checksum update. At the finish of transaction commit all replicas have to be written to the mirror. If checksum verification failed, then we issue a read IO request for the replica block of the mirror. Comment. Mirrors can be internal (when we allocate replicas on the same partition) and external (when we allocate replicas on different device). = 5. Why use crc32c for checksums? = Modern CPUs have instructions, which allow to compute a full 32-bit CRC step in 3 cycles. = 6. How to protect data? = Currently we don't support checksums for unformatted blocks, where bodies of large files are stored. If you want to protect your data (not only metadata), then you have 3 options: 1) Make sure that reiser4 stores bodies of your files in fragments (i.e. "inline" data chunks). Fragments are always stored in formatted nodes, which are protected by checksums. It is possible with mkfs option "formatting=tails" for files managed by unix_file plugin (if you don't use compression) or "compressMode=latt" for files managed by cryptcompress plugin (if you use compression). NOTE. This option will lead to performance degradation (especially for delete operations). 2) Protect your data by yourself. If a file system guarantees consistency of metadata, then data protection can be successfuly implemented in the user-space. Indeed, since file body is uniquely determined by extent pointers, which are guaranteed to be consistent, then checking consistency of the file's body in user space is always a correct operation. So, feel free to check your data in the user-space: we have provided basis for this. 3) Implement checksums for unformatted nodes in reiser4. This option requires a new format for extent pointers (which will include a 32-bit field for checksum), and, respectively, a new item plugin (extent-pointer-with-checksum, or so). = 7. How to enable checksum support in reiser4 = Specify node plugin with protection by mkfs.reiser4 option "'''-o node=node41'''" when formatting your partition and mount as usual. = 8. Compatibility with other features = Checksums are compatible with all reiser4 features. Adding a checksum support is a great example of how reiser4 resists the problem of creeping featurism. We just added a new node plugin, which manages nodes of a new format (node41) with a 32-bit field for the checksum. The new plugin mostly reuses methods of the old one (node40) as you can see from the patches for [http://marc.info/?l=reiserfs-devel&m=142359111509525&w=2 reiser4] and [http://marc.info/?l=reiserfs-devel&m=142359112409527&w=2 reiser4progs] = 9. TODO = A. Failover via mirroring (see section 4 for implementation hints). B. Maintain checksums for the superblock and bitmap blocks. Comment. We already have such support for bitmap blocks, however, it uses adler32 and checksums update/verification is not invoked for some historical reasons. I suggest to replace adler32 with crc32c and trigger the update/verification. Comment. For superblock protection we need to add a 32-bit field to the disk superblock and update/verify it like in the case of formatted nodes. [[category:Reiser4]] e7124f4c2efce52f738a2749581648740ec2e808 4066 2015-08-29T11:36:35Z Edward 4 Add Reiser4_checksums page 1. Why protect (meta)data? We want to be protected against hardware problems such as data rot in memory and decay of storage media. We want to be sure that our data structures are consistent, because working with corrupted data structures is dangerous. Strictly speaking, such protection is not a business of a file system. It would be more logical to assume that this is a business of the upper and the lower subsystems. To be precisely, protection against data rot in memory is a business of the memory controller, and protection against decay of storage media is a business of the block device controller/driver. However, frequently the mentioned subsystems don't provide such protection for various reasons. As the result the file system suffers (becomes corrupted, inconsistent), and poor users start to blame file system developers. 2. Why "inline" checksums? Reiser4 stores per-node checksum right in the node that we want to protect. This is much more efficient than using dedicated data structures for checksums, as we don't need to launch expensive search procedures every time when we need to access a checksum. Using dedicated data structures to store checksums is a design mistake. 3. When we check/update per-node checksums? Let's start from protection against storage media decay. If someone wants protection against data rot in memory, then let me know. Since we implement protection against storage media decay, it is enough to check [update] a checksum right after IO completion [before submitting IO request]. We don't need to update a checksum after every modification. So, updating checksums in Reiser4 is a delayed action. Reiser4 updates per-node checksum at commit time right before writing the node to disk. At the moment of checksum update any process modifying this node will be blocked on an attempt to acquire an exclusive access: longterm_lock_znode -> try_capture_block Thus, updated checksum won't be "spoiled" before hitting the disk. Checksum verification is going right after read IO completion in the ->parse() method of node plugin. 4. How we handle corruptions If node's checksum verification failed, them further working with such node is dangerous. In this case the partition will be remounted by kernel as readonly with a suggestion in kernel logs to check it by fsck. TODO: Online failover mode is in plans. For this mode we need to support mirror(s). Every in-memory replica gets updated at the moment of the checksum update. At the finish of transaction commit all replicas have to be written to the mirror. If checksum verification failed, then we issue a read IO request for the replica block of the mirror. Comment. Mirrors can be internal (when we allocate replicas on the same partition) and external (when we allocate replicas on different device). 5. Why use crc32c for checksums? Modern CPUs have instructions, which allow to compute a full 32-bit CRC step in 3 cycles. 6. How to protect data? Currently we don't support checksums for unformatted blocks, where bodies of large files are stored. If you want to protect your data (not only metadata), then you have 3 options: 1) Make sure that reiser4 stores bodies of your files in fragments (i.e. "inline" data chunks). Fragments are always stored in formatted nodes, which are protected by checksums. It is possible with mkfs option "formatting=tails" for files managed by unix_file plugin (if you don't use compression) or "compressMode=latt" for files managed by cryptcompress plugin (if you use compression). NOTE. This option will lead to performance degradation (especially for delete operations). 2) Protect your data by yourself. If a file system guarantees consistency of metadata, then data protection can be successfuly implemented in the user-space. Indeed, since file body is uniquely determined by extent pointers, which are guaranteed to be consistent, then checking consistency of the file's body in user space is always a correct operation. So, feel free to check your data in the user-space: we have provided basis for this. 3) Implement checksums for unformatted nodes in reiser4. This option requires a new format for extent pointers (which will include a 32-bit field for checksum), and, respectively, a new item plugin (extent-pointer-with-checksum, or so). 7. How to enable checksum support in reiser4 Specify node plugin with protection by mkfs.reiser4 option "'''-o node=node41'''" when formatting your partition and mount as usual. 8. Compatibility with other features Checksums are compatible with all reiser4 features. Adding a checksum support is a great example of how reiser4 resists the problem of creeping featurism. We just added a new node plugin, which manages nodes of a new format (node41) with a 32-bit field for the checksum. The new plugin mostly reuses methods of the old one (node40) as you can see from the patches for [http://marc.info/?l=reiserfs-devel&m=142359111509525&w=2 reiser4] and [http://marc.info/?l=reiserfs-devel&m=142359112409527&w=2 reiser4progs] 9. TODO A. Failover via mirroring (see section 4 for implementation hints). B. Maintain checksums for the superblock and bitmap blocks. Comment. We already have such support for bitmap blocks, however, it uses adler32 and checksums update/verification is not invoked for some historical reasons. I suggest to replace adler32 with crc32c and trigger the update/verification. Comment. For superblock protection we need to add a 32-bit field to the disk superblock and update/verify it like in the case of formatted nodes. e252f250f0b2927b32038df956e37e7400d67c59 Reiser4 development model 0 1089 4289 4271 2017-09-28T12:51:25Z Edward 4 = Why worry about compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects get added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility. Software upgrades, which are fully backward and forward compatible (like adding discard/TRIM support) is not a concern. Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Framework Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software framework release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as disk format plugin id specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal SFRN of reiser4 kernel module (and reiser4progs), which is fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] 4a4786aae8ceae86a1bbefb711600e5ba599c305 4271 4109 2017-07-06T09:33:17Z Edward 4 Fixed typo = Why worry about compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects get added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility, so we don't consider software upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Framework Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software framework release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as disk format plugin id specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal SFRN of reiser4 kernel module (and reiser4progs), which is fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] 8ad82895c295b13993fd60af74308c74f9b38f61 4109 4107 2016-01-14T16:31:50Z Edward 4 /* Reiser4 development policy */ = Why worry about compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects get added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility, so we don't consider software upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Format Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as disk format plugin id specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal SFRN of reiser4 kernel module (and reiser4progs), which is fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] 3d979c937a05756d10015de5ae25b14a0b85b1b6 4107 4105 2016-01-14T11:34:40Z Edward 4 /* Disk format version numbers */ = Why worry about compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects get added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility, so we don't consider upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Format Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as disk format plugin id specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal SFRN of reiser4 kernel module (and reiser4progs), which is fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] fbbfa6be4e6642f9421c5aee2caa42eb1be84b01 4105 4103 2016-01-14T11:33:17Z Edward 4 /* Disk format version numbers */ = Why worry about compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects get added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility, so we don't consider upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Format Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as disk format plugin id specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal software release number of reiser4 kernel module (and reiser4progs), which is fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] 6233463eddc92df6865c6a6776d906bc7695fbb8 4103 4101 2016-01-14T11:26:00Z Edward 4 /* Disk format version numbers */ = Why worry about compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects get added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility, so we don't consider upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Format Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as disk format plugin id specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal version of the plugin library fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] 12a1bb1355680c879480048ca6f1e4497ef1502d 4101 4072 2016-01-14T11:24:56Z Edward 4 /* Why worry about compatibility? */ = Why worry about compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects get added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility, so we don't consider upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Format Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as the serial number of disk format plugin specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal version of the plugin library fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] 7256fd3865c307908845dcdf90de5792585c2763 4072 4070 2015-08-30T17:15:34Z Edward 4 = Why worry about compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects are added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility, so we don't consider upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Format Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as the serial number of disk format plugin specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal version of the plugin library fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] 534198c3093d29d64e871e4750b793b2e2bcfd65 4070 4067 2015-08-30T15:41:59Z Edward 4 Formatting changes = Why to track compatibility? = Reiser4 is in permanent development. New plugins managing new disk objects are added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. = Reiser4 development policy = Here we discuss things needed to keep a track of compatibility, so we don't consider upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. = Software Format Release Numbers = Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). = Compatibility of software upgrades = Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. = Disk format version numbers = Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as the serial number of disk format plugin specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal version of the plugin library fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. [[category:Reiser4]] 1681786e7efb0b10aaffc2e101cf26600a4e1798 4067 4065 2015-08-30T09:30:52Z Edward 4 Added section "Why to track compatibility?" 1. Why to track compatibility? Reiser4 is in permanent development. New plugins managing new disk objects are added systematically. And the common problem is that old software is unaware of new objects. The biggest concern is old fsck utilities, which treat the new "unknown" objects as corruption. So we need a mechanism to restrict, or even prevent access to new objects for old software. 2. Reiser4 development policy Here we discuss things needed to keep a track of compatibility, so we don't consider upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. 3. Software Format Release Numbers Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). 4. Compatibility of software upgrades Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. 5. Disk format version numbers Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as the serial number of disk format plugin specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal version of the plugin library fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. 0b0b4786359537caaba45027ce50009f5522f6a6 4065 2015-08-29T10:20:38Z Edward 4 Add Reiser4_development_model page 1. Reiser4 development policy Here we discuss things needed to keep a track of compatibility, so we don't consider upgrades, which are fully backward and forward compatible (like adding discard/TRIM support). Other upgrades of reiser4 kernel module (reiser4progs) always look like adding a plugin or a set of plugins of existing, or new interfaces. Hence, such upgrades are always backward compatible. With every such upgrade developer has to increment plugin library version number of both, reiser4 kernel module and reiser4progs. See definition of the macro PLUGIN_LIBRARY_VERSION in reiser4 kernel module and reiser4progs. 2. Software Format Release Numbers Any reiser4 kernel module (reiser4progs package) has a software format release number (SFRN) 4.X.Y, where X and Y are major and minor SFRN respectively. Major SFRN of reiser4 module (reiser4progs) is defined as the maximal serial number of supported disk format plugin (see definition of the struct disk_format_plugin). Minor SFRN of reiser4 module (reiser4progs) is defined as version number of the plugin library. SFRN of reiser4 kernel module is reported by kernel (look for the string "Loading Reiser4 (format release: 4.X.Y)" in kernel messages. SFRN of reiser4progs is reported by every utility provided by this package when specifying option -V (Look for the string "Format release 4.X.Y"). Don't confuse SFRN and the version number of reiser4progs package! The last one takes into account all software changes (including fully compatible ones). 3. Compatibility of software upgrades Every upgrade of reiser4 kernel module and reiser4progs performed in accordance with the development model (section 1) is automatically backward compatible. Every upgrade of reiser4progs SFRN is forward incompatible. That is, reiser4progs will refuse to with any partition with greater DFVN. This is for more reliability: fsck treats any "unknown" objects as a corruption. "Major" upgrades of reiser4 kernel module (4.X.Y) -> (4.X+1.Y) are fully forward incompatible. "Minor" upgrades of reiser4 kernel module (4.X.Y) -> (4.X.Y+1) are partially forward compatible. 4. Disk format version numbers Any reiser4 partition has a disk format versions number (DFVN) 4.A.B, where A and B are respectively major and minor DFVN. Major DFVN is defined as the serial number of disk format plugin specified when formatting the partition. Major DFVN never changes. Minor DFVN is defined as the minimal version of the plugin library fully compatible with this partition. Initially every partition gets minor DFVN equal to minor SFRN of the mkfs.reiser4 utility when formatting the partition. Further minor DFVN can be upgraded. Any partition 4.A.B can be correctly handled only by reiser4progs 4.U.V, where A <= U and B <= V. Otherwise, reiser4progs will refuse to work with such partition (user will be suggested to upgrade reiser4progs). Any partition 4.A.B can be fully mounted only by kernel module 4.U.B, where A <= U. If A > U, then mount will fail with a kernel warning about forward incompatibility and a suggestion to upgrade the kernel. When mounting a partition 4.A.B by kernel module 4.U.V, where A <= U, B != V, then 2 scenarios are possible: If B > V, then mount will succeed, but kernel will warn about partial forward compatibility (it means that not all disk objects will be accessible in that mount session). If B < V, then mount will succeed. Kernel will upgrade DFVN in the primary superblock with the suggestion to upgrade DFVN in the superblock replicas (backup blocks) by running fsck.reiser4 --fix. DFVN can be fully upgraded by any reiser4.fsck utility of proper SFRN. Reiser4 kernel module upgrades DFVN only in the primary superblock. Downgrades of DFVN are not supported/planned. DFVN is reported by kernel at mount time (look for the string: "reiser4: found disk format 4.A.B"). DFVN is also reported by debugfs.reiser4 utility in default mode. 85631cde3e7fe1be63f8bb33f989cffc9de4f94b Reiser4 discard support 0 1081 4043 4042 2015-02-03T18:12:12Z Edward 4 Update status of Precise discard in the page Starting from reiser4-for-3.16.2 SSD users can mount their reiser4 partitions with the option "discard" and the file system will issue discard requests to inform the device that blocks are not longer used. In reiser4 issuing discard requests is a delayed action (like many other actions including block allocation, compression, etc). It means that discard requests are accumulated as the release of blocks and issued as background process after issuing of all other usual requests. Such delayed technique allows to issue discard requests of better quality, because the discard requests get merged in the process of accumulation. Another advantage of the delayed discard is that the "non-queued TRIM" is not a problem for us. = Implementation details = Managing discard requests is a business of reiser4 transaction manager. This is because blocks deallocation is a kind of events which are tracked by the transaction manager. Deallocted extents are captured by a respective transaction atom. At commit time all the extents are sorted, merged and discarded right after overwriting journalled blocks at their permanent location on disk. After issuing the discard requests we complete the transaction by deallocating respective blocks at working bitmap (which always resides in memory). Thus, we guarantee consistency (nobody touches our extents before discard completion). We don't record information about discard extents in the journal. The worst thing that can happen after system crash is missing a number of discard requests. It doesn't break consistency of the file systems however. For such unpleasant situations we'll provide support of FITRIM ioctl later. [[PreciseDiscard|Precise discard]] is available with the patch against reiser4-for-3.17.3, which can be found on [http://sourceforge.net/projects/reiser4/files/patches/3.17.3-reiser4-precise-discard-support.patch.gz SourceForge]. '''WARNING:''' Don't use it for important data for now. Even in the case of no visible problems, please, check your partition by fsck.reiser4 as frequently as possible. [[Mailinglists|Report]] about found inconsistency. = Discard support in reiser4progs = When formatting your SSD partition by mkfs.reiser4 use the option -d: it will issue discard request for the whole partition before creating reiser4 structure on it. This option is available in [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9]. In order to build reiser4progs-1.0.9 you will need [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6]. [[category:Reiser4]] 6351d6a652cf5b4b638df8cb2d8eabe6ddc21867 4042 4011 2015-02-03T17:47:30Z Edward 4 Starting from reiser4-for-3.16.2 SSD users can mount their reiser4 partitions with the option "discard" and the file system will issue discard requests to inform the device that blocks are not longer used. In reiser4 issuing discard requests is a delayed action (like many other actions including block allocation, compression, etc). It means that discard requests are accumulated as the release of blocks and issued as background process after issuing of all other usual requests. Such delayed technique allows to issue discard requests of better quality, because the discard requests get merged in the process of accumulation. Another advantage of the delayed discard is that the "non-queued TRIM" is not a problem for us. = Implementation details = Managing discard requests is a business of reiser4 transaction manager. This is because blocks deallocation is a kind of events which are tracked by the transaction manager. Deallocted extents are captured by a respective transaction atom. At commit time all the extents are sorted, merged and discarded right after overwriting journalled blocks at their permanent location on disk. After issuing the discard requests we complete the transaction by deallocating respective blocks at working bitmap (which always resides in memory). Thus, we guarantee consistency (nobody touches our extents before discard completion). We don't record information about discard extents in the journal. The worst thing that can happen after system crash is missing a number of discard requests. It doesn't break consistency of the file systems however. For such unpleasant situations we'll provide support of FITRIM ioctl later. [[PreciseDiscard|Precise discard]] is available with the patch against reiser4-for-3.17.3, which can be found on [http://sourceforge.net/projects/reiser4/files/patches/ SourceForge]. '''WARNING:''' Don't use it for important data for now. Even in the case of no visible problems, please, check your partition by fsck.reiser4 as frequently as possible. [[Mailinglists|Report]] about found inconsistency. = Discard support in reiser4progs = When formatting your SSD partition by mkfs.reiser4 use the option -d: it will issue discard request for the whole partition before creating reiser4 structure on it. This option is available in [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9]. In order to build reiser4progs-1.0.9 you will need [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6]. [[category:Reiser4]] bb56614279f62b18e138f1e06b464032c58cae04 4011 3951 2014-08-11T21:03:38Z Chris goe 2 category added / Also: http://www.w3.org/QA/Tips/noClickHere Now SSD users can mount their reiser4 partitions with the option "discard" and the file system will issue discard requests to inform the device that blocks are not longer used. In reiser4 issuing discard requests is a delayed action (like many other actions including block allocation, compression, etc). It means that discard requests are accumulated as the release of blocks and issued as background process after issuing of all other usual requests. Such delayed technique allows to issue discard requests of better quality, because the discard requests get merged in the process of accumulation. Another advantage of the delayed discard is that the "non-queued TRIM" is not a problem for us. = Implementation details = Managing discard requests is a business of reiser4 transaction manager. This is because blocks deallocation is a kind of events which are tracked by the transaction manager. Deallocted extents are captured by a respective transaction atom. At commit time all the extents are sorted, merged and discarded right after overwriting journalled blocks at their permanent location on disk. After issuing the discard requests we complete the transaction by deallocating respective blocks at working bitmap (which always resides in memory). Thus, we guarantee consistency (nobody touches our extents before discard completion). We don't record information about discard extents in the journal. The worst thing that can happen after system crash is missing a number of discard requests. It doesn't break consistency of the file systems however. For such unpleasant situations we'll provide support of FITRIM ioctl later. Also in plans: garbage collection at the head and tail of every extent to be discarded. The algorithms are complicated, however, the game is worth the candle. The patch against reiser4-for-3.15 can be found on [http://sourceforge.net/projects/reiser4/files/patches/ SourceForge]. '''WARNING:''' Don't use it for important data for now. Even in the case of no visible problems, please, check your partition by fsck.reiser4 as frequently as possible. [[Mailinglists|Report]] about found inconsistency. = Discard support in reiser4progs = When formatting your SSD partition by mkfs.reiser4 use the option -d: it will issue discard request for the whole partition before creating reiser4 structure on it. This option is available in [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs reiser4progs-1.0.9]. In order to build reiser4progs-1.0.9 you will need [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal libaal-1.0.6]. [[category:Reiser4]] 0648a02a8dcd516a6f98d58d0f6daffea26aaee1 3951 2014-08-11T11:05:14Z Edward 4 Add Reiser4 discard support page Now SSD users can mount their reiser4 partitions with the option "discard" and the file system will issue discard requests to inform the device that blocks are not longer used. In reiser4 issuing discard requests is a delayed action (like many other actions including block allocation, compression, etc). It means that discard requests are accumulated as the release of blocks and issued as background process after issuing of all other usual requests. Such delayed technique allows to issue discard requests of better quality, because the discard requests get merged in the process of accumulation. Another advantage of the delayed discard is that the "non-queued TRIM" is not a problem for us. Implementation details Managing discard requests is a business of reiser4 transaction manager. This is because blocks deallocation is a kind of events which are tracked by the transaction manager. Deallocted extents are captured by a respective transaction atom. At commit time all the extents are sorted, merged and discarded right after overwriting journalled blocks at their permanent location on disk. After issuing the discard requests we complete the transaction by deallocating respective blocks at working bitmap (which always resides in memory). Thus, we guarantee consistency (nobody touches our extents before discard completion). We don't record information about discard extents in the journal. The worst thing that can happen after system crash is missing a number of discard requests. It doesn't break consistency of the file systems however. For such unpleasant situations we'll provide support of FITRIM ioctl later. Also in plans: garbage collection at the head and tail of every extent to be discarded. The algorithms are complicated, however, the game is worth the candle. The patch against reiser4-for-3.15 can be found [http://sourceforge.net/projects/reiser4/files/patches/ here]. WARNING: Don't use it for important data for now. Even in the case of no visible problems, please, check your partition by fsck.reiser4 as frequently as possible. Report about found inconsistency. Discard support in reiser4progs: When formatting your SSD partition by mkfs.reiser4 use the option -d: it will issue discard request for the whole partition before creating reiser4 structure on it. This option is available in reiser4progs-1.0.9, please find [http://sourceforge.net/projects/reiser4/files/reiser4-utils/reiser4progs here]. In order to build reiser4progs-1.0.9 you will need libaal-1.0.6, please find [http://sourceforge.net/projects/reiser4/files/reiser4-utils/libaal here]. d371d13d4c0ce8a7e56ab1c51ed8855686e6d449 Reiser4 patchsets 0 8 4317 4315 2019-04-16T08:52:39Z Chris goe 2 updated to 5.0; 404s fixed '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <code>reiser4</code> is still not in [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git mainline], Edward Shishkin is kind enough to provide [https://sourceforge.net/projects/reiser4/files/ patches] for the stable version of the Linux kernel. = Standalone <code>reiser4</code> tree = According to [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Reiser4 Upstream Git Repositories on GitHub], this is how the the standalone tree can be used. Download the latest Reiser4 kernel patch and patch the kernel: wget <nowiki>https://sourceforge.net/</nowiki>projects/reiser4/files/reiser4-for-linux-5.x/reiser4-for-5.0.0.patch.gz cd /usr/local/src/linux-git git checkout -b reiser-5 v5.0 gzip -dc ~/reiser4-for-5.0.0.patch.gz | patch -p1 Replace <code>fs/reiser4</code> with the [https://github.com/edward6/reiser4 standalone version]: rm -r fs/reiser4 cd ../ git clone <nowiki>https://github.com/</nowiki>edward6/reiser4 reiser4-git cd reiser4-git/ git archive --format=tar --prefix=reiser4<font color="red">/</font> HEAD | tar -C ../linux-git/fs/ -xvf - Be sure to adjust the directories as necessary on your system! With that, we should be able to build the kernel now: cd ../linux-git/ make menuconfig > enable CONFIG_BLOCK > enable CONFIG_REISER4_FS make ... = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show/home:doiggl/kernel-reiser4 kernel-reiser4] (doiggl) * [https://build.opensuse.org/package/show/home:doiggl/reiser4-kmp reiser4-kmp] (doiggl) * [https://build.opensuse.org/package/show/filesystems/reiser4progs reiser4progs] [[category:Reiser4]] 9d3d4f5177fe608badc477d5660667d51b44ee14 4315 4173 2019-04-15T20:57:08Z Chris goe 2 -version number '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <tt>reiser4</tt> is still not in [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git mainline], Edward Shishkin is kind enough to provide [https://sourceforge.net/projects/reiser4/files/ patches] for the stable version of the Linux kernel. = Standalone <tt>reiser4</tt> tree = According to [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Reiser4 Upstream Git Repositories on GitHub], this is how the the standalone tree can be used. Download the latest [https://sourceforge.net/projects/reiser4/ Reiser4] kernel patch and patch the kernel: wget https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/reiser4-for-4.7.0.patch.gz cd /usr/local/src/linux-git git checkout -b local-4.7 v4.7 gzip -dc ~/reiser4-for-4.7.0.patch.gz | patch -p1 Replace <tt>fs/reiser4</tt> with the standalone version: rm -r fs/reiser4 cd ../ git clone https://github.com/edward6/reiser4 reiser4-git cd reiser4-git/ git archive --format=tar --prefix=reiser4<font color="red">/</font> HEAD | tar -C ../linux-git/fs/ -xvf - Be sure to adjust the directories as necessary on your system! With that, we should be able to build the kernel now: cd ../linux-git/ > enable CONFIG_BLOCK > enable CONFIG_REISER4_FS = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show?package=kernel-reiser4&project=home%3Adoiggl reiser4] (doiggl) * [https://build.opensuse.org/package/show?package=reiser4-kmp&project=drivers%3Afilesystems reiser4-kmp] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. [[category:Reiser4]] ec7cb5cf83534f7d45f0d8b51b0f0682c8040325 4173 4171 2016-09-24T22:29:47Z Chris goe 2 Fedora no longer carries reiser4 packages :-\ '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <tt>reiser4</tt> is still not in [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git mainline], Edward Shishkin is kind enough to provide [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches] for the stable version of the Linux kernel. = Standalone <tt>reiser4</tt> tree = According to [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Reiser4 Upstream Git Repositories on GitHub], this is how the the standalone tree can be used. Download the latest [https://sourceforge.net/projects/reiser4/ Reiser4] kernel patch and patch the kernel: wget https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/reiser4-for-4.7.0.patch.gz cd /usr/local/src/linux-git git checkout -b local-4.7 v4.7 gzip -dc ~/reiser4-for-4.7.0.patch.gz | patch -p1 Replace <tt>fs/reiser4</tt> with the standalone version: rm -r fs/reiser4 cd ../ git clone https://github.com/edward6/reiser4 reiser4-git cd reiser4-git/ git archive --format=tar --prefix=reiser4<font color="red">/</font> HEAD | tar -C ../linux-git/fs/ -xvf - Be sure to adjust the directories as necessary on your system! With that, we should be able to build the kernel now: cd ../linux-git/ > enable CONFIG_BLOCK > enable CONFIG_REISER4_FS = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show?package=kernel-reiser4&project=home%3Adoiggl reiser4] (doiggl) * [https://build.opensuse.org/package/show?package=reiser4-kmp&project=drivers%3Afilesystems reiser4-kmp] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. [[category:Reiser4]] bca4fda5bf6f8ee7ff1c0ff6038aa67dd68ded17 4171 4169 2016-09-24T22:26:46Z Chris goe 2 the -mm tree no longer carries fs/reiser4 :-\ '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <tt>reiser4</tt> is still not in [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git mainline], Edward Shishkin is kind enough to provide [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches] for the stable version of the Linux kernel. = Standalone <tt>reiser4</tt> tree = According to [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Reiser4 Upstream Git Repositories on GitHub], this is how the the standalone tree can be used. Download the latest [https://sourceforge.net/projects/reiser4/ Reiser4] kernel patch and patch the kernel: wget https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/reiser4-for-4.7.0.patch.gz cd /usr/local/src/linux-git git checkout -b local-4.7 v4.7 gzip -dc ~/reiser4-for-4.7.0.patch.gz | patch -p1 Replace <tt>fs/reiser4</tt> with the standalone version: rm -r fs/reiser4 cd ../ git clone https://github.com/edward6/reiser4 reiser4-git cd reiser4-git/ git archive --format=tar --prefix=reiser4<font color="red">/</font> HEAD | tar -C ../linux-git/fs/ -xvf - Be sure to adjust the directories as necessary on your system! With that, we should be able to build the kernel now: cd ../linux-git/ > enable CONFIG_BLOCK > enable CONFIG_REISER4_FS = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show?package=kernel-reiser4&project=home%3Adoiggl reiser4] (doiggl) * [https://build.opensuse.org/package/show?package=reiser4-kmp&project=drivers%3Afilesystems reiser4-kmp] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. == Fedora == Fedora is hosting a few reiser4 packages too: * <s>[http://viji.fedorapeople.org/reiser4/F13/x86_64/ Fedora 13] (Viji V Nair)</s> [[category:Reiser4]] 45a977012a8bb80916c274c33aeb77fec781f1b3 4169 4167 2016-09-24T22:23:02Z Chris goe 2 how to compile with reiser4-git '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <tt>reiser4</tt> is still not in [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git mainline], Edward Shishkin is kind enough to provide [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches] for the stable version of the Linux kernel. = Standalone <tt>reiser4</tt> tree = According to [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Reiser4 Upstream Git Repositories on GitHub], this is how the the standalone tree can be used. Download the latest [https://sourceforge.net/projects/reiser4/ Reiser4] kernel patch and patch the kernel: wget https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-4.x/reiser4-for-4.7.0.patch.gz cd /usr/local/src/linux-git git checkout -b local-4.7 v4.7 gzip -dc ~/reiser4-for-4.7.0.patch.gz | patch -p1 Replace <tt>fs/reiser4</tt> with the standalone version: rm -r fs/reiser4 cd ../ git clone https://github.com/edward6/reiser4 reiser4-git cd reiser4-git/ git archive --format=tar --prefix=reiser4<font color="red">/</font> HEAD | tar -C ../linux-git/fs/ -xvf - Be sure to adjust the directories as necessary on your system! With that, we should be able to build the kernel now: cd ../linux-git/ > enable CONFIG_BLOCK > enable CONFIG_REISER4_FS = Development patchsets = <s>Andrew Morton is maintaining the [https://git.kernel.org/?p=linux/kernel/git/mhocko/mm.git mm-tree] that includes [[Reiser4]] as well. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-))</s> = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show?package=kernel-reiser4&project=home%3Adoiggl reiser4] (doiggl) * [https://build.opensuse.org/package/show?package=reiser4-kmp&project=drivers%3Afilesystems reiser4-kmp] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. == Fedora == Fedora is hosting a few reiser4 packages too: * <s>[http://viji.fedorapeople.org/reiser4/F13/x86_64/ Fedora 13] (Viji V Nair)</s> [[category:Reiser4]] e21ceb1707b56cacacb7c0a818d7b910cbb56e64 4167 4001 2016-09-24T22:08:37Z Chris goe 2 Standalone reiser4 tree '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <tt>reiser4</tt> is still not in [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git mainline], Edward Shishkin is kind enough to provide [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches] for the stable version of the Linux kernel. = Standalone <tt>reiser4</tt> tree = According to [http://www.spinics.net/lists/reiserfs-devel/msg05173.html Reiser4 Upstream Git Repositories on GitHub], this is how the the standalone tree can be used: TBD . Patch the respective kernel with the latest available stuff from Sourceforge; . cd to the "fs" directory; . delete the directory reiser4; . instead of the deleted stuff clone the standalone reiser4 repository from Github; . build and install as usual. = Development patchsets = <s>Andrew Morton is maintaining the [https://git.kernel.org/?p=linux/kernel/git/mhocko/mm.git mm-tree] that includes [[Reiser4]] as well. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-))</s> = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show?package=kernel-reiser4&project=home%3Adoiggl reiser4] (doiggl) * [https://build.opensuse.org/package/show?package=reiser4-kmp&project=drivers%3Afilesystems reiser4-kmp] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. == Fedora == Fedora is hosting a few reiser4 packages too: * <s>[http://viji.fedorapeople.org/reiser4/F13/x86_64/ Fedora 13] (Viji V Nair)</s> [[category:Reiser4]] 3d75bf06742a393b9b79c96d738b2a5b94cf90e3 4001 2551 2014-08-11T21:01:18Z Chris goe 2 sf url updated '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <tt>reiser4</tt> is still not in [https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git mainline], Edward Shishkin is kind enough to provide [http://sourceforge.net/projects/reiser4/files/reiser4-for-linux-3.x/ patches] for the stable version of the Linux kernel. = Development patchsets = <s>Andrew Morton is maintaining the [https://git.kernel.org/?p=linux/kernel/git/mhocko/mm.git mm-tree] that includes [[Reiser4]] as well. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-))</s> = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show?package=kernel-reiser4&project=home%3Adoiggl reiser4] (doiggl) * [https://build.opensuse.org/package/show?package=reiser4-kmp&project=drivers%3Afilesystems reiser4-kmp] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. == Fedora == Fedora is hosting a few reiser4 packages too: * <s>[http://viji.fedorapeople.org/reiser4/F13/x86_64/ Fedora 13] (Viji V Nair)</s> [[category:Reiser4]] ec519a97dd1bc67b2c66e410ca796ecf042dbc0c 2551 2541 2012-09-25T17:50:35Z Chris goe 2 no more reiser4 in -mm '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <tt>reiser4</tt> is still not in [https://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git mainline], Edward Shishkin is kind enough to provide [http://sourceforge.net/projects/reiser4/ patches] for the stable version of the Linux kernel. = Development patchsets = <s>Andrew Morton is maintaining the [https://git.kernel.org/?p=linux/kernel/git/mhocko/mm.git mm-tree] that includes [[Reiser4]] as well. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-))</s> = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show?package=kernel-reiser4&project=home%3Adoiggl reiser4] (doiggl) * [https://build.opensuse.org/package/show?package=reiser4-kmp&project=drivers%3Afilesystems reiser4-kmp] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. == Fedora == Fedora is hosting a few reiser4 packages too: * <s>[http://viji.fedorapeople.org/reiser4/F13/x86_64/ Fedora 13] (Viji V Nair)</s> [[category:Reiser4]] 7a32a2d08c374b12d16d3949084454d6ae1e2980 2541 2152 2012-09-25T17:45:43Z Chris goe 2 url cleanup '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' = Stable patchsets = As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://sourceforge.net/projects/reiser4/ patches] for the stable version of the Linux kernel. = Development patchsets = Andrew Morton is maintaining the [http://git.zen-kernel.org/mmotm/ MMOTM] tree ([http://www.kernel.org/patchtypes/mm.html mm]-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) = Distribution packages = == openSUSE == openSUSE is building reiser4 packages too: * [https://build.opensuse.org/package/show?package=kernel-reiser4&project=home%3Adoiggl reiser4] (doiggl) * [https://build.opensuse.org/package/show?package=reiser4-kmp&project=drivers%3Afilesystems reiser4-kmp] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. == Fedora == Fedora is hosting a few reiser4 packages too: * <s>[http://viji.fedorapeople.org/reiser4/F13/x86_64/ Fedora 13] (Viji V Nair)</s> [[category:Reiser4]] ef90fecf1ed8e210cfa734240a8bf2ed2bc65b32 2152 2142 2010-10-27T23:08:50Z Chris goe 2 link removed '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-kernel.org/mmotm/ MMOTM] tree ([http://www.kernel.org/patchtypes/mm.html mm]-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == === openSUSE === openSUSE is building [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] too: * [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] (Jeff Mahoney) * [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.2/ openSUSE 11.2] (Jeff Mahoney) * [http://download.opensuse.org/repositories/home:/doiggl/openSUSE_11.3/ openSUSE 11.3] (doiggl) * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. === Fedora === Fedora is hosting a few reiser4 packages too: * [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Fedora 13] (Viji V Nair) [[category:Reiser4]] 341c3f3692aa0ec565886f015cab0cb23b40b381 2142 2132 2010-10-27T23:08:35Z Chris goe 2 fedora added '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-kernel.org/mmotm/ MMOTM] tree ([http://www.kernel.org/patchtypes/mm.html mm]-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == === openSUSE === [http://opensuse.org/ openSUSE] is building [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] too: * [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] (Jeff Mahoney) * [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.2/ openSUSE 11.2] (Jeff Mahoney) * [http://download.opensuse.org/repositories/home:/doiggl/openSUSE_11.3/ openSUSE 11.3] (doiggl) * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. === Fedora === Fedora is hosting a few reiser4 packages too: * [http://viji.fedorapeople.org/reiser4/F13/x86_64/ Fedora 13] (Viji V Nair) [[category:Reiser4]] 10c9d047a50999b5f02d9da0d632c3b09255add8 2132 2122 2010-10-27T23:05:39Z Chris goe 2 cleanup '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-kernel.org/mmotm/ MMOTM] tree ([http://www.kernel.org/patchtypes/mm.html mm]-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == Apparently [http://opensuse.org/ openSUSE] is building [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] too: * [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] (Jeff Mahoney) * [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.2/ openSUSE 11.2] (Jeff Mahoney) * [http://download.opensuse.org/repositories/home:/doiggl/openSUSE_11.3/ openSUSE 11.3] (doiggl) * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] (Jeff Mahoney) Please see [https://build.opensuse.org/project/show?project=drivers:filesystems drivers:filesystems] for more external file system modules. [[category:Reiser4]] 22c83a9a145ee31143b8c04d9b73ad5f6f3b5758 2122 2112 2010-10-27T23:00:19Z Chris goe 2 11.0 is no more, 11.2 added '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-kernel.org/mmotm/ MMOTM] tree ([http://www.kernel.org/patchtypes/mm.html mm]-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == Apparently [http://opensuse.org/ openSUSE] is trying to [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] in their ''development'' and ''factory'' (trunk) releases: * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.2/ openSUSE 11.2] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] (current) Jeff Mahoney maintains these packages in openSUSE's [https://build.opensuse.org/project/show?project=drivers:filesystems build service]. <small>(login needed)</small> [[category:Reiser4]] 4e08fd77cc3a903b77829fa223860cb262a81a87 2112 1640 2010-10-27T22:58:02Z Chris goe 2 url has changed '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-kernel.org/mmotm/ MMOTM] tree ([http://www.kernel.org/patchtypes/mm.html mm]-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == Apparently [http://opensuse.org/ openSUSE] is trying to [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] in their ''development'' and ''factory'' (trunk) releases: * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] Jeff Mahoney maintains these packages in openSUSE's [https://build.opensuse.org/project/show?project=drivers:filesystems build service]. <small>(login needed)</small> [[category:Reiser4]] dd019e50f387f95c3b9abb6081d888c2028a6e45 1640 1507 2009-11-18T19:08:08Z Chris goe 2 zen-sources is dead, long live zen-kernel! '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-kernel.org/?p=kernel/mmotm.git;a=summary MMOTM] tree ([http://www.kernel.org/patchtypes/mm.html mm]-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == Apparently [http://opensuse.org/ openSUSE] is trying to [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] in their ''development'' and ''factory'' (trunk) releases: * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] Jeff Mahoney maintains these packages in openSUSE's [https://build.opensuse.org/project/show?project=drivers:filesystems build service]. <small>(login needed)</small> [[category:Reiser4]] de588903686ead3afd8aac809b9eb614b77f66ea 1507 1506 2009-06-27T17:33:41Z Chris goe 2 mm? '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-sources.org/?p=mmotm.git;a=summary MMOTM] tree ([http://www.kernel.org/patchtypes/mm.html mm]-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == Apparently [http://opensuse.org/ openSUSE] is trying to [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] in their ''development'' and ''factory'' (trunk) releases: * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] Jeff Mahoney maintains these packages in openSUSE's [https://build.opensuse.org/project/show?project=drivers:filesystems build service]. [[category:Reiser4]] 89f13198ce35644eea4d3779a10f83f82374a6c4 1506 1403 2009-06-27T17:30:20Z Chris goe 2 -> Reiser4_Howto '''Please help with [[Reiser4_Howto|testing Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-sources.org/?p=mmotm.git;a=summary MMOTM] tree (mm-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == Apparently [http://opensuse.org/ openSUSE] is trying to [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] in their ''development'' and ''factory'' (trunk) releases: * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] Jeff Mahoney maintains these packages in openSUSE's [https://build.opensuse.org/project/show?project=drivers:filesystems build service]. [[category:Reiser4]] b815c1a8b701dae010e537f409a4160aca90ddc3 1403 1402 2009-06-26T20:58:25Z Chris goe 2 build url added '''Please help with testing [[Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-sources.org/?p=mmotm.git;a=summary MMOTM] tree (mm-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == Apparently [http://opensuse.org/ openSUSE] is trying to [https://build.opensuse.org/package/show?project=drivers:filesystems&package=reiser4-kmp build reiser4 packages] in their ''development'' and ''factory'' (trunk) releases: * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] Jeff Mahoney maintains these packages in openSUSE's [https://build.opensuse.org/project/show?project=drivers:filesystems build service]. [[category:Reiser4]] e05c317ff567c79ca43e4c6f0176ff223a8a111a 1402 1392 2009-06-26T20:52:35Z Chris goe 2 opensuse support '''Please help with testing [[Reiser4]] and [[mailinglists|report]] any issues to the mailinglist!''' == Stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == Development patchsets == Andrew Morton is maintaining the [http://git.zen-sources.org/?p=mmotm.git;a=summary MMOTM] tree (mm-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) == Distribution packages == Apparently [http://opensuse.org/ openSUSE] is trying to support [[Reiser4]] in their ''development'' and ''factory'' (trunk) releases: * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.0/ openSUSE 11.0] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_11.1/ openSUSE 11.1] * Reiser4 for [http://download.opensuse.org/repositories/drivers://filesystems/openSUSE_Factory/ openSUSE Factory] Jeff Mahoney maintains these packages in openSUSE's [https://build.opensuse.org/project/show?project=drivers:filesystems build service]. [[category:Reiser4]] f2405d9e809209a660a014c3edd9c9bc3c4a9783 1392 1384 2009-06-26T04:02:22Z Chris goe 2 warning added == stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == development patchsets == Andrew Morton is maintaining the [http://git.zen-sources.org/?p=mmotm.git;a=summary MMOTM] tree (mm-of-the-moment) that includes [[Reiser4]] as well. Snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. Be aware that this tree in high flux and often filled with exotic stuff (yes, more exotic than reiser4 ;-)) '''After [[Reiser4_Howto|applying the patches]] please test and [[mailinglists|report]] any issues!''' [[category:Reiser4]] 724cc02940259939c9364284066338bfc11b87aa 1384 1383 2009-06-26T01:53:37Z Chris goe 2 url madness == stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D patches] for the stable version of the Linux kernel. == development patchsets == Andrew Morton is maintaining the [http://git.zen-sources.org/?p=mmotm.git;a=summary MMOTM] tree (mm-of-the-moment), snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. '''After [[Reiser4_Howto|applying the patches]] please test and [[mailinglists|report]] any issues!''' [[category:Reiser4]] 512fb2b450bd5262463523a5452be6a18eae703c 1383 1381 2009-06-25T23:47:18Z Chris goe 2 mmotm added == stable patchsets == As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide patches: [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/] == development patchsets == Andrew Morton is maintaining the [http://git.zen-sources.org/?p=mmotm.git;a=summary MMOTM] tree (mm-of-the-moment), snapshot patches from that tree can be found on [http://userweb.kernel.org/~akpm/mmotm/ kernel.org]. '''After [[Reiser4_Howto|applying the patches]], please test and [[mailinglists|report]] any issues!''' [[category:Reiser4]] 59a80de20dc4285cc3a583424ebf96fe403de7df 1381 1296 2009-06-25T15:40:15Z Chris goe 2 howto added As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide patches: [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/] After [[Reiser4_Howto|applying the patches]], please test and [[mailinglists|report]] any issues! [[category:Reiser4]] b87a26a618448698bcb90e89ca45b0bf2b2ccbe3 1296 2009-06-25T06:13:13Z Chris goe 2 Created page with 'As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide patch...' As <tt>reiser4</tt> is still not in [http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=summary mainline], Edward Shishkin is kind enough to provide patches: [http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/?C=M;O=D http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/] Please test! [[category:Reiser4]] 85b8c0cb3d4f5bbb9425a474310430407256bf0e Reiser4 transaction models 0 1061 4457 4287 2020-12-10T19:34:09Z Edward 4 Reiser4 supports multiple transaction models. As you probably know, all other file systems implement only a single transaction model. That is, they all are either only journalling (ext3/4, ReiserFS(v3), XFS, jfs, ...), or only "write-anywhere" (ZFS, Btrfs, etc). However, journalling file systems are not the best choice for SSD drives (as they issue larger number of IOs because of double writes - first you should write to journal, and then to the permanent location on disk. As you guess, larger number of IOs means performance drop and reduced life of SSD drives. As to "write-anywhere" file systems: they work badly with HDD drives. Indeed, in accordance with this transaction model you can not overwrite blocks on disk. Instead, you should write the modified buffers to different location, and after making sure that they have been written successfully, deallocate old blocks. Such mandatory relocations lead to rapid external fragmentation, especially when you perform a lot of overwrites at random offsets. Respectively, the performance rapidly degrades. To improve the situation you need to incessantly run defragmentation tools. Starting from reiser4-for-3.14.1, you can choose a transaction model, which is most suitable for your device. This is very simple: just specify it by respective mount option. Currently there are 3 options: 1) '''Journalling''' (mount option "txmod=journal"). In this mode all overwritten buffers (nodes) will be committed via journal like in ReiserFS(v3), Ext4, XFS, etc. (We remind that instead of obsolete "journal block devices" Reiser4 uses more advanced technique of wandering logs). This mode is for HDD users, who complaint about fragmentation of reiser4 volumes. We imagine, that this is not a 100% panacea against fragmentation, but it is better than nothing: in this mode the situation with fragmentation has to be not worse than in ReiserFS(v3)! Alas, the 100% panacea (reiser4 repacker) is still a long-term todo. 2) '''Write-Anywhere''' (mount option "txmod=wa") All modified nodes in this mode will get new location on disk (like in ZFS, Btrfs, etc). In this mode reiser4 doesn't make active attempts to defragment atoms. In this mode reiser4 will issue minimal number of IOs, however reiser4 volumes will be rapidly fragmented. This option is only for SSD users. 3) '''Hybrid transaction model''' (mount option "txmod=hybrid") This is the default model suggested by Hans Reiser and Josh MacDonald in ~2002. This model uses an advanced feature of reiser4 transaction manager, so-called "compound checkpoints", which means that a part of dirty nodes is committed via journal (overwrite), and another part is committed via write-anywhere technique (i.e. gets another location on disk). All relocate-overwrite decisions in this mode are results of attempts to defragment locality of atoms that are to be committed. Clean nodes of this locality also can be involved to the commit process (their location on disk will be changed, if it provides excellent results). More details can be found [[Reiser4_Hybrid_transaction_model | here]] In this model number of issued IOs is not so large as in traditional Journalling model, and fragmentation is not so rapid as in traditional Write-Anywhere model. However, such local defragmentation doesn't help a lot in some cases of workload, and we periodically get complaints from users about degradation of reiser4 volumes. So, this model is for HDD users, who don't perform a lot of random overwrites. Once the repacker is ready, we'll recommend this mode for all HDD users (just because pure journalling is anyway suboptimal for HDD drives). Implementation details We introduce a new layer/interface TXMOD (Transaction MODel) called at flush time for reiser4 atoms. Every plugin of this interface is a high-level block allocator, which assigns block numbers to dirty nodes, and, thereby, decides, how those nodes will be committed. Every dirty node of reiser4 atom can be committed by either of the following two ways: 1) via journal; 2) using "write-anywhere" technique. If the allocator doesn't change on-disk location of a node, then this node will be committed using journalling technique (overwrite). Otherwise, it will be committed via write-anywhere technique (relocate) relocate <---- allocate --- > overwrite So, in our interpretation the two traditional "classic" strategies in committing transactions (journalling and "write-anywhere") are just two boundary cases: 1) when all nodes are overwritten, and 2) when all nodes are relocated. Besides those 2 boundary cases we can implement the infinite set of their various combinations, so that user can choose what is really suitable for his needs. How it looks in practice Let's create a large enough file on a reiser4 partition (let it be a 645K /etc/services): # mkfs.reiser4 -o create=reg40 /dev/sdb5 # mount /dev/sdb5 /mnt # cp /etc/services /mnt/. # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that file data is represented by a single extent of 162 blocks starting at block #25. Let's overwrite first 100K of this file in journalling transaction model: # mount /dev/sdb5 -o txmod=journal /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that overwritten nodes occupy the same location on disk, and our extent hasn't beed destroyed (fragmented). Moreover, the modified parent node occupies the same location on disk (block #23). Let's now overwrite first 100K of this file in Write-Anywhere transaction mode: # mount /dev/sdb5 -o txmod=wa /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (213) LEVEL=2 ITEMS=2 SPACE=3952 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [187] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=32, flags=0x0 UNITS=2 [188(25) 50(137)] ============================================================================== We can see, that first 100K (25 blocks) has been relocated in accordance with "Write-Anywhere" transaction model: initial extent has been split into 2 ones: first unit consists of 25 relocated blocks, which start at block #188, and second unit consists of 137 blocks, which occupy the same location on disk. Modified parent also got new location (block #213 - was #23). Let's calculate total number of IOs issued when overwriting the file in different modes: 1) '''Journalling''' 50 blocks were submitted for data modification (25 has been written to journal, and 25 to permanent location); 2 blocks were submitted to modify parent (block #23 in the dump) (1 to journal, and 1 to permanent location); 2 blocks to modify bitmap (1 to journal, and 1 to permanent location) 2 blocks to modify superblock (1 to journal, and 1 to permanent location) -------------------- Total: 56 blocks. 2) '''Write-Anywhere''' 25 blocks were submitted (relocated) for data modifications; 1 block was submitted to modify parent, which got new location #213; 2 blocks were submitted to modify bitmap (1 to journal, and 1 to permanent location); 2 blocks were submitted to modify superblock (1 to journal, and 1 to permanent location); NOTE: system blocks (bitmaps, superblock, etc) can not be relocated in reiser4, so we always commit them via journal. --------------------- Total: 30 blocks. So we have 56 IOs issued in journalling mode against 30 IOs in Write-Anywhere. However, fragmentation is a payment for the smaller number of IOs in Write-Anywhere mode (see the last dump, where we have 2 extents). So this transaction model is only for SSD drives, as they are not sensitive to external fragmentation. Again, "journal" is for HDD, and "wa" is for SSD, please, don't confuse! ---------------------------------------------------------------------- MOUNT OPTION INTENDED FOR DEFAULT ---------------------------------------------------------------------- txmod=journal HDD users no ---------------------------------------------------------------------- txmod=wa SSD users no ---------------------------------------------------------------------- txmod=hybrid HDD users, who don't perform yes a lot of random overwrites ---------------------------------------------------------------------- [[category:Reiser4]] 1b9331f035d324c3eed0b1691dccf57339c2010d 4287 4083 2017-07-21T23:57:40Z Edward 4 Added Link to the page Reiser4_Hybrid_transaction_model Reiser4 supports multiple transaction models. As you probably know, all other file systems implement only a single transaction model. That is, they all are either only journalling (ext3/4, ReiserFS(v3), XFS, jfs, ...), or only "write-anywhere" (ZFS, Btrfs, etc). However, journalling file systems are not the best choice for SSD drives (as they issue larger number of IOs because of double writes - first you should write to journal, and then to the permanent location on disk. As you guess, larger number of IOs means performance drop and reduced life of SSD drives. As to "write-anywhere" file systems: they work badly with HDD drives. Indeed, in accordance with this transaction model you can not overwrite blocks on disk. Instead, you should write the modified buffers to different location, and after making sure that they have been written successfully, deallocate old blocks (sometimes this transaction model is called "Copy-on-Write", but we will use the historically first name "Write-Anywhere"). Such mandatory relocations lead to rapid external fragmentation, especially when you perform a lot of overwrites at random offsets. Respectively, the performance rapidly degrades. To improve the situation you need to incessantly run defragmentation tools. Starting from reiser4-for-3.14.1, you can choose a transaction model, which is most suitable for your device. This is very simple: just specify it by respective mount option. Currently there are 3 options: 1) '''Journalling''' (mount option "txmod=journal"). In this mode all overwritten buffers (nodes) will be committed via journal like in ReiserFS(v3), Ext4, XFS, etc. (We remind that instead of obsolete "journal block devices" Reiser4 uses more advanced technique of wandering logs). This mode is for HDD users, who complaint about fragmentation of reiser4 volumes. We imagine, that this is not a 100% panacea against fragmentation, but it is better than nothing: in this mode the situation with fragmentation has to be not worse than in ReiserFS(v3)! Alas, the 100% panacea (reiser4 repacker) is still a long-term todo. 2) '''Write-Anywhere, aka Copy-on-Write''' (mount option "txmod=wa") All modified nodes in this mode will get new location on disk (like in ZFS, Btrfs, etc). In this mode reiser4 doesn't make active attempts to defragment atoms. In this mode reiser4 will issue minimal number of IOs, however reiser4 volumes will be rapidly fragmented. This option is only for SSD users. 3) '''Hybrid transaction model''' (mount option "txmod=hybrid") This is the default model suggested by Hans Reiser and Josh MacDonald in ~2002. This model uses an advanced feature of reiser4 transaction manager, so-called "compound checkpoints", which means that a part of dirty nodes is committed via journal (overwrite), and another part is committed via write-anywhere technique (i.e. gets another location on disk). All relocate-overwrite decisions in this mode are results of attempts to defragment locality of atoms that are to be committed. Clean nodes of this locality also can be involved to the commit process (their location on disk will be changed, if it provides excellent results). More details can be found [[Reiser4_Hybrid_transaction_model | here]] In this model number of issued IOs is not so large as in traditional Journalling model, and fragmentation is not so rapid as in traditional Write-Anywhere (CoW) model. However, such local defragmentation doesn't help a lot in some cases of workload, and we periodically get complaints from users about degradation of reiser4 volumes. So, this model is for HDD users, who don't perform a lot of random overwrites. Once the repacker is ready, we'll recommend this mode for all HDD users (just because pure journalling is anyway suboptimal for HDD drives). Implementation details We introduce a new layer/interface TXMOD (Transaction MODel) called at flush time for reiser4 atoms. Every plugin of this interface is a high-level block allocator, which assigns block numbers to dirty nodes, and, thereby, decides, how those nodes will be committed. Every dirty node of reiser4 atom can be committed by either of the following two ways: 1) via journal; 2) using "write-anywhere" technique. If the allocator doesn't change on-disk location of a node, then this node will be committed using journalling technique (overwrite). Otherwise, it will be committed via write-anywhere technique (relocate) relocate <---- allocate --- > overwrite So, in our interpretation the two traditional "classic" strategies in committing transactions (journalling and "write-anywhere") are just two boundary cases: 1) when all nodes are overwritten, and 2) when all nodes are relocated. Besides those 2 boundary cases we can implement the infinite set of their various combinations, so that user can choose what is really suitable for his needs. How it looks in practice Let's create a large enough file on a reiser4 partition (let it be a 645K /etc/services): # mkfs.reiser4 -o create=reg40 /dev/sdb5 # mount /dev/sdb5 /mnt # cp /etc/services /mnt/. # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that file data is represented by a single extent of 162 blocks starting at block #25. Let's overwrite first 100K of this file in journalling transaction model: # mount /dev/sdb5 -o txmod=journal /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that overwritten nodes occupy the same location on disk, and our extent hasn't beed destroyed (fragmented). Moreover, the modified parent node occupies the same location on disk (block #23). Let's now overwrite first 100K of this file in Write-Anywhere (Copy-on-Write) transaction mode: # mount /dev/sdb5 -o txmod=wa /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (213) LEVEL=2 ITEMS=2 SPACE=3952 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [187] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=32, flags=0x0 UNITS=2 [188(25) 50(137)] ============================================================================== We can see, that first 100K (25 blocks) has been relocated in accordance with "Write-Anywhere" transaction model: initial extent has been split into 2 ones: first unit consists of 25 relocated blocks, which start at block #188, and second unit consists of 137 blocks, which occupy the same location on disk. Modified parent also got new location (block #213 - was #23). Let's calculate total number of IOs issued when overwriting the file in different modes: 1) '''Journalling''' 50 blocks were submitted for data modification (25 has been written to journal, and 25 to permanent location); 2 blocks were submitted to modify parent (block #23 in the dump) (1 to journal, and 1 to permanent location); 2 blocks to modify bitmap (1 to journal, and 1 to permanent location) 2 blocks to modify superblock (1 to journal, and 1 to permanent location) -------------------- Total: 56 blocks. 2) '''Write-Anywhere (Copy-on-Write)''' 25 blocks were submitted (relocated) for data modifications; 1 block was submitted to modify parent, which got new location #213; 2 blocks were submitted to modify bitmap (1 to journal, and 1 to permanent location); 2 blocks were submitted to modify superblock (1 to journal, and 1 to permanent location); NOTE: system blocks (bitmaps, superblock, etc) can not be relocated in reiser4, so we always commit them via journal. --------------------- Total: 30 blocks. So we have 56 IOs issued in journalling mode against 30 IOs in Write-Anywhere. However, fragmentation is a payment for the smaller number of IOs in Write-Anywhere mode (see the last dump, where we have 2 extents). So this transaction model is only for SSD drives, as they are not sensitive to external fragmentation. Again, "journal" is for HDD, and "wa" is for SSD, please, don't confuse! ---------------------------------------------------------------------- MOUNT OPTION INTENDED FOR DEFAULT ---------------------------------------------------------------------- txmod=journal HDD users no ---------------------------------------------------------------------- txmod=wa SSD users no ---------------------------------------------------------------------- txmod=hybrid HDD users, who don't perform yes a lot of random overwrites ---------------------------------------------------------------------- [[category:Reiser4]] 3ca68f77e1e3e51f60c9a37def60debb7b3d2b77 4083 3831 2015-09-24T19:57:37Z Chris goe 2 category added Reiser4 supports multiple transaction models. As you probably know, all other file systems implement only a single transaction model. That is, they all are either only journalling (ext3/4, ReiserFS(v3), XFS, jfs, ...), or only "write-anywhere" (ZFS, Btrfs, etc). However, journalling file systems are not the best choice for SSD drives (as they issue larger number of IOs because of double writes - first you should write to journal, and then to the permanent location on disk. As you guess, larger number of IOs means performance drop and reduced life of SSD drives. As to "write-anywhere" file systems: they work badly with HDD drives. Indeed, in accordance with this transaction model you can not overwrite blocks on disk. Instead, you should write the modified buffers to different location, and after making sure that they have been written successfully, deallocate old blocks (sometimes this transaction model is called "Copy-on-Write", but we will use the historically first name "Write-Anywhere"). Such mandatory relocations lead to rapid external fragmentation, especially when you perform a lot of overwrites at random offsets. Respectively, the performance rapidly degrades. To improve the situation you need to incessantly run defragmentation tools. Starting from reiser4-for-3.14.1, you can choose a transaction model, which is most suitable for your device. This is very simple: just specify it by respective mount option. Currently there are 3 options: 1) '''Journalling''' (mount option "txmod=journal"). In this mode all overwritten buffers (nodes) will be committed via journal like in ReiserFS(v3), Ext4, XFS, etc. (We remind that instead of obsolete "journal block devices" Reiser4 uses more advanced technique of wandering logs). This mode is for HDD users, who complaint about fragmentation of reiser4 volumes. We imagine, that this is not a 100% panacea against fragmentation, but it is better than nothing: in this mode the situation with fragmentation has to be not worse than in ReiserFS(v3)! Alas, the 100% panacea (reiser4 repacker) is still a long-term todo. 2) '''Write-Anywhere, aka Copy-on-Write''' (mount option "txmod=wa") All modified nodes in this mode will get new location on disk (like in ZFS, Btrfs, etc). In this mode reiser4 doesn't make active attempts to defragment atoms. In this mode reiser4 will issue minimal number of IOs, however reiser4 volumes will be rapidly fragmented. This option is only for SSD users. 3) '''Hybrid transaction model''' (mount option "txmod=hybrid") This is the default model suggested by Hans Reiser and Josh MacDonald in ~2002. This model uses an advanced feature of reiser4 transaction manager, so-called "compound checkpoints", which means that a part of dirty nodes is committed via journal (overwrite), and another part is committed via write-anywhere technique (i.e. gets another location on disk). All relocate-overwrite decisions in this mode are results of attempts to defragment locality of atoms that are to be committed. Clean nodes of this locality also can be involved to the commit process (their location on disk will be changed, if it provides excellent results). In this model number of issued IOs is not so large as in traditional Journalling model, and fragmentation is not so rapid as in traditional Write-Anywhere (CoW) model. However, such local defragmentation doesn't help a lot in some cases of workload, and we periodically get complaints from users about degradation of reiser4 volumes. So, this model is for HDD users, who don't perform a lot of random overwrites. Once the repacker is ready, we'll recommend this mode for all HDD users (just because pure journalling is anyway suboptimal for HDD drives). Implementation details We introduce a new layer/interface TXMOD (Transaction MODel) called at flush time for reiser4 atoms. Every plugin of this interface is a high-level block allocator, which assigns block numbers to dirty nodes, and, thereby, decides, how those nodes will be committed. Every dirty node of reiser4 atom can be committed by either of the following two ways: 1) via journal; 2) using "write-anywhere" technique. If the allocator doesn't change on-disk location of a node, then this node will be committed using journalling technique (overwrite). Otherwise, it will be committed via write-anywhere technique (relocate) relocate <---- allocate --- > overwrite So, in our interpretation the two traditional "classic" strategies in committing transactions (journalling and "write-anywhere") are just two boundary cases: 1) when all nodes are overwritten, and 2) when all nodes are relocated. Besides those 2 boundary cases we can implement the infinite set of their various combinations, so that user can choose what is really suitable for his needs. How it looks in practice Let's create a large enough file on a reiser4 partition (let it be a 645K /etc/services): # mkfs.reiser4 -o create=reg40 /dev/sdb5 # mount /dev/sdb5 /mnt # cp /etc/services /mnt/. # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that file data is represented by a single extent of 162 blocks starting at block #25. Let's overwrite first 100K of this file in journalling transaction model: # mount /dev/sdb5 -o txmod=journal /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that overwritten nodes occupy the same location on disk, and our extent hasn't beed destroyed (fragmented). Moreover, the modified parent node occupies the same location on disk (block #23). Let's now overwrite first 100K of this file in Write-Anywhere (Copy-on-Write) transaction mode: # mount /dev/sdb5 -o txmod=wa /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (213) LEVEL=2 ITEMS=2 SPACE=3952 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [187] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=32, flags=0x0 UNITS=2 [188(25) 50(137)] ============================================================================== We can see, that first 100K (25 blocks) has been relocated in accordance with "Write-Anywhere" transaction model: initial extent has been split into 2 ones: first unit consists of 25 relocated blocks, which start at block #188, and second unit consists of 137 blocks, which occupy the same location on disk. Modified parent also got new location (block #213 - was #23). Let's calculate total number of IOs issued when overwriting the file in different modes: 1) '''Journalling''' 50 blocks were submitted for data modification (25 has been written to journal, and 25 to permanent location); 2 blocks were submitted to modify parent (block #23 in the dump) (1 to journal, and 1 to permanent location); 2 blocks to modify bitmap (1 to journal, and 1 to permanent location) 2 blocks to modify superblock (1 to journal, and 1 to permanent location) -------------------- Total: 56 blocks. 2) '''Write-Anywhere (Copy-on-Write)''' 25 blocks were submitted (relocated) for data modifications; 1 block was submitted to modify parent, which got new location #213; 2 blocks were submitted to modify bitmap (1 to journal, and 1 to permanent location); 2 blocks were submitted to modify superblock (1 to journal, and 1 to permanent location); NOTE: system blocks (bitmaps, superblock, etc) can not be relocated in reiser4, so we always commit them via journal. --------------------- Total: 30 blocks. So we have 56 IOs issued in journalling mode against 30 IOs in Write-Anywhere. However, fragmentation is a payment for the smaller number of IOs in Write-Anywhere mode (see the last dump, where we have 2 extents). So this transaction model is only for SSD drives, as they are not sensitive to external fragmentation. Again, "journal" is for HDD, and "wa" is for SSD, please, don't confuse! ---------------------------------------------------------------------- MOUNT OPTION INTENDED FOR DEFAULT ---------------------------------------------------------------------- txmod=journal HDD users no ---------------------------------------------------------------------- txmod=wa SSD users no ---------------------------------------------------------------------- txmod=hybrid HDD users, who don't perform yes a lot of random overwrites ---------------------------------------------------------------------- [[category:Reiser4]] ca0924153aa73f34b15e19de5d83690ce2dd8cac 3831 3801 2014-05-08T22:45:21Z Edward 4 Minor fixes Reiser4 supports multiple transaction models. As you probably know, all other file systems implement only a single transaction model. That is, they all are either only journalling (ext3/4, ReiserFS(v3), XFS, jfs, ...), or only "write-anywhere" (ZFS, Btrfs, etc). However, journalling file systems are not the best choice for SSD drives (as they issue larger number of IOs because of double writes - first you should write to journal, and then to the permanent location on disk. As you guess, larger number of IOs means performance drop and reduced life of SSD drives. As to "write-anywhere" file systems: they work badly with HDD drives. Indeed, in accordance with this transaction model you can not overwrite blocks on disk. Instead, you should write the modified buffers to different location, and after making sure that they have been written successfully, deallocate old blocks (sometimes this transaction model is called "Copy-on-Write", but we will use the historically first name "Write-Anywhere"). Such mandatory relocations lead to rapid external fragmentation, especially when you perform a lot of overwrites at random offsets. Respectively, the performance rapidly degrades. To improve the situation you need to incessantly run defragmentation tools. Starting from reiser4-for-3.14.1, you can choose a transaction model, which is most suitable for your device. This is very simple: just specify it by respective mount option. Currently there are 3 options: 1) '''Journalling''' (mount option "txmod=journal"). In this mode all overwritten buffers (nodes) will be committed via journal like in ReiserFS(v3), Ext4, XFS, etc. (We remind that instead of obsolete "journal block devices" Reiser4 uses more advanced technique of wandering logs). This mode is for HDD users, who complaint about fragmentation of reiser4 volumes. We imagine, that this is not a 100% panacea against fragmentation, but it is better than nothing: in this mode the situation with fragmentation has to be not worse than in ReiserFS(v3)! Alas, the 100% panacea (reiser4 repacker) is still a long-term todo. 2) '''Write-Anywhere, aka Copy-on-Write''' (mount option "txmod=wa") All modified nodes in this mode will get new location on disk (like in ZFS, Btrfs, etc). In this mode reiser4 doesn't make active attempts to defragment atoms. In this mode reiser4 will issue minimal number of IOs, however reiser4 volumes will be rapidly fragmented. This option is only for SSD users. 3) '''Hybrid transaction model''' (mount option "txmod=hybrid") This is the default model suggested by Hans Reiser and Josh MacDonald in ~2002. This model uses an advanced feature of reiser4 transaction manager, so-called "compound checkpoints", which means that a part of dirty nodes is committed via journal (overwrite), and another part is committed via write-anywhere technique (i.e. gets another location on disk). All relocate-overwrite decisions in this mode are results of attempts to defragment locality of atoms that are to be committed. Clean nodes of this locality also can be involved to the commit process (their location on disk will be changed, if it provides excellent results). In this model number of issued IOs is not so large as in traditional Journalling model, and fragmentation is not so rapid as in traditional Write-Anywhere (CoW) model. However, such local defragmentation doesn't help a lot in some cases of workload, and we periodically get complaints from users about degradation of reiser4 volumes. So, this model is for HDD users, who don't perform a lot of random overwrites. Once the repacker is ready, we'll recommend this mode for all HDD users (just because pure journalling is anyway suboptimal for HDD drives). Implementation details We introduce a new layer/interface TXMOD (Transaction MODel) called at flush time for reiser4 atoms. Every plugin of this interface is a high-level block allocator, which assigns block numbers to dirty nodes, and, thereby, decides, how those nodes will be committed. Every dirty node of reiser4 atom can be committed by either of the following two ways: 1) via journal; 2) using "write-anywhere" technique. If the allocator doesn't change on-disk location of a node, then this node will be committed using journalling technique (overwrite). Otherwise, it will be committed via write-anywhere technique (relocate) relocate <---- allocate --- > overwrite So, in our interpretation the two traditional "classic" strategies in committing transactions (journalling and "write-anywhere") are just two boundary cases: 1) when all nodes are overwritten, and 2) when all nodes are relocated. Besides those 2 boundary cases we can implement the infinite set of their various combinations, so that user can choose what is really suitable for his needs. How it looks in practice Let's create a large enough file on a reiser4 partition (let it be a 645K /etc/services): # mkfs.reiser4 -o create=reg40 /dev/sdb5 # mount /dev/sdb5 /mnt # cp /etc/services /mnt/. # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that file data is represented by a single extent of 162 blocks starting at block #25. Let's overwrite first 100K of this file in journalling transaction model: # mount /dev/sdb5 -o txmod=journal /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that overwritten nodes occupy the same location on disk, and our extent hasn't beed destroyed (fragmented). Moreover, the modified parent node occupies the same location on disk (block #23). Let's now overwrite first 100K of this file in Write-Anywhere (Copy-on-Write) transaction mode: # mount /dev/sdb5 -o txmod=wa /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (213) LEVEL=2 ITEMS=2 SPACE=3952 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [187] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=32, flags=0x0 UNITS=2 [188(25) 50(137)] ============================================================================== We can see, that first 100K (25 blocks) has been relocated in accordance with "Write-Anywhere" transaction model: initial extent has been split into 2 ones: first unit consists of 25 relocated blocks, which start at block #188, and second unit consists of 137 blocks, which occupy the same location on disk. Modified parent also got new location (block #213 - was #23). Let's calculate total number of IOs issued when overwriting the file in different modes: 1) '''Journalling''' 50 blocks were submitted for data modification (25 has been written to journal, and 25 to permanent location); 2 blocks were submitted to modify parent (block #23 in the dump) (1 to journal, and 1 to permanent location); 2 blocks to modify bitmap (1 to journal, and 1 to permanent location) 2 blocks to modify superblock (1 to journal, and 1 to permanent location) -------------------- Total: 56 blocks. 2) '''Write-Anywhere (Copy-on-Write)''' 25 blocks were submitted (relocated) for data modifications; 1 block was submitted to modify parent, which got new location #213; 2 blocks were submitted to modify bitmap (1 to journal, and 1 to permanent location); 2 blocks were submitted to modify superblock (1 to journal, and 1 to permanent location); NOTE: system blocks (bitmaps, superblock, etc) can not be relocated in reiser4, so we always commit them via journal. --------------------- Total: 30 blocks. So we have 56 IOs issued in journalling mode against 30 IOs in Write-Anywhere. However, fragmentation is a payment for the smaller number of IOs in Write-Anywhere mode (see the last dump, where we have 2 extents). So this transaction model is only for SSD drives, as they are not sensitive to external fragmentation. Again, "journal" is for HDD, and "wa" is for SSD, please, don't confuse! ---------------------------------------------------------------------- MOUNT OPTION INTENDED FOR DEFAULT ---------------------------------------------------------------------- txmod=journal HDD users no ---------------------------------------------------------------------- txmod=wa SSD users no ---------------------------------------------------------------------- txmod=hybrid HDD users, who don't perform yes a lot of random overwrites ---------------------------------------------------------------------- 1a85c87379c031e77fc3275f71ff396f60f5b0bb 3801 3771 2014-05-08T15:42:42Z Edward 4 Minor fixes Reiser4 supports multiple transaction models. As you probably know, all other file systems implement only a single transaction model. That is, they all are either only journalling (ext3/4, ReiserFS(v3), XFS, jfs, ...), or only "write-anywhere" (ZFS, Btrfs, etc). However, journalling file systems are not the best choice for SSD drives (as they issue larger number of IOs because of double writes - first you should write to journal, and then to the permanent location on disk. As you guess, larger number of IOs means performance drop and reduced life of SSD drives. As to "write-anywhere" file systems: they work badly with HDD drives. Indeed, in accordance with this transaction model you can not overwrite blocks on disk. Instead, you should write the modified buffers to different location, and after making sure that they have been written successfully, deallocate old blocks (sometimes this transaction model is called "Copy-on-Write", but we will use the historically first name "Write-Anywhere"). Such mandatory relocations lead to rapid external fragmentation, especially when you perform a lot of overwrites at random offsets. Respectively, the performance rapidly degrades. To improve the situation you need to incessantly run defragmentation tools. Starting from reiser4-for-3.14.1 patch, Reiser4 users now can choose a transaction model which is most suitable for their devices. This is very simple: just specify it by respective mount option. With the patch applied you will have 3 options: 1) '''Journalling''' (mount option "txmod=journal"). In this mode all overwritten buffers (nodes) will be committed via journal like in ReiserFS(v3), Ext4, XFS, etc. (We remind that instead of obsolete "journal block devices" Reiser4 uses more advanced technique of wandering logs). This mode is for HDD users, who complaint about fragmentation of reiser4 volumes. We imagine, that this is not a 100% panacea against fragmentation, but it is better than nothing: in this mode the situation with fragmentation has to be not worse than in ReiserFS(v3)! Alas, the 100% panacea (reiser4 repacker) is still a long-term todo. 2) '''Write-Anywhere, aka Copy-on-Write''' (mount option "txmod=wa") All modified nodes in this mode will get new location on disk (like in ZFS, Btrfs, etc). In this mode reiser4 doesn't make active attempts to defragment atoms. In this mode reiser4 will issue minimal number of IOs, however reiser4 volumes will be rapidly fragmented. This option is only for SSD users. 3) '''Hybrid transaction model''' (mount option "txmod=hybrid") This is the default model suggested by Hans Reiser and Josh MacDonald in ~2002. This model uses an advanced feature of reiser4 transaction manager, so-called "compound checkpoints", which means that a part of dirty nodes is committed via journal (overwrite), and another part is committed via write-anywhere technique (i.e. gets another location on disk). All relocate-overwrite decisions in this mode are results of attempts to defragment locality of atoms that are to be committed. Clean nodes of this locality also can be involved to the commit process (their location on disk will be changed, if it provides excellent results). In this model number of issued IOs is not so large as in traditional Journalling model, and fragmentation is not so rapid as in traditional Write-Anywhere (CoW) model. However, such local defragmentation doesn't help a lot in some cases of workload, and we periodically get complaints from users about degradation of reiser4 volumes. So, this model is for HDD users, who don't perform a lot of random overwrites. Once the repacker is ready, we'll recommend this mode for all HDD users (just because pure journalling is anyway suboptimal for HDD drives). Implementation details We introduce a new layer/interface TXMOD (Transaction MODel) called at flush time for reiser4 atoms. Every plugin of this interface is a high-level block allocator, which assigns block numbers to dirty nodes, and, thereby, decides, how those nodes will be committed. Every dirty node of reiser4 atom can be committed by either of the following two ways: 1) via journal; 2) using "write-anywhere" technique. If the allocator doesn't change on-disk location of a node, then this node will be committed using journalling technique (overwrite). Otherwise, it will be committed via write-anywhere technique (relocate) relocate <---- allocate --- > overwrite So, in our interpretation the two traditional "classic" strategies in committing transactions (journalling and "write-anywhere") are just two boundary cases: 1) when all nodes are overwritten, and 2) when all nodes are relocated. Besides those 2 boundary cases we can implement the infinite set of their various combinations, so that user can choose what is really suitable for his needs. How it looks in practice Let's create a large enough file on a reiser4 partition (let it be a 645K /etc/services): # mkfs.reiser4 -o create=reg40 /dev/sdb5 # mount /dev/sdb5 /mnt # cp /etc/services /mnt/. # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that file data is represented by a single extent of 162 blocks starting at block #25. Let's overwrite first 100K of this file in journalling transaction model: # mount /dev/sdb5 -o txmod=journal /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that overwritten nodes occupy the same location on disk, and our extent hasn't beed destroyed (fragmented). Moreover, the modified parent node occupies the same location on disk (block #23). Let's now overwrite first 100K of this file in Write-Anywhere (Copy-on-Write) transaction mode: # mount /dev/sdb5 -o txmod=wa /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (213) LEVEL=2 ITEMS=2 SPACE=3952 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [187] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=32, flags=0x0 UNITS=2 [188(25) 50(137)] ============================================================================== We can see, that first 100K (25 blocks) has been relocated in accordance with "Write-Anywhere" transaction model: initial extent has been split into 2 ones: first unit consists of 25 relocated blocks, which start at block #188, and second unit consists of 137 blocks, which occupy the same location on disk. Modified parent also got new location (block #213 - was #23). Let's calculate total number of IOs issued when overwriting the file in different modes: 1) '''Journalling''' 50 blocks were submitted for data modification (25 has been written to journal, and 25 to permanent location); 2 blocks were submitted to modify parent (block #23 in the dump) (1 to journal, and 1 to permanent location); 2 blocks to modify bitmap (1 to journal, and 1 to permanent location) 2 blocks to modify superblock (1 to journal, and 1 to permanent location) -------------------- Total: 56 blocks. 2) '''Write-Anywhere (Copy-on-Write)''' 25 blocks were submitted (relocated) for data modifications; 1 block was submitted to modify parent, which got new location #213; 2 blocks were submitted to modify bitmap (1 to journal, and 1 to permanent location); 2 blocks were submitted to modify superblock (1 to journal, and 1 to permanent location); NOTE: system blocks (bitmaps, superblock, etc) can not be relocated in reiser4, so we always commit them via journal. --------------------- Total: 30 blocks. So we have 56 IOs issued in journalling mode against 30 IOs in Write-Anywhere. However, fragmentation is a payment for the smaller number of IOs in Write-Anywhere mode (see the last dump, where we have 2 extents). So this transaction model is only for SSD drives, as they are not sensitive to external fragmentation. Again, "journal" is for HDD, and "wa" is for SSD, please, don't confuse! ---------------------------------------------------------------------- MOUNT OPTION INTENDED FOR DEFAULT ---------------------------------------------------------------------- txmod=journal HDD users no ---------------------------------------------------------------------- txmod=wa SSD users no ---------------------------------------------------------------------- txmod=hybrid HDD users, who don't perform yes a lot of random overwrites ---------------------------------------------------------------------- e701e18836d57ded0a64e651dcae4a2cb4c6d339 3771 3761 2014-05-08T14:57:17Z Edward 4 Specify reiser4 stuff with the feature "different transaction models" Reiser4 supports multiple transaction models. As you probably know, all other file systems implement only a single transaction model. That is, they all are either only journalling (ext3/4, ReiserFS(v3), XFS, jfs, ...), or only "write-anywhere" (ZFS, Btrfs, etc). However, journalling file systems are not the best choice for SSD drives (as they issue larger number of IOs because of double writes - first you should write to journal, and then to the permanent location on disk. As you guess, larger number of IOs means performance drop and reduced life of SSD drives. As to "write-anywhere" file systems: they work badly with HDD drives. Indeed, in accordance with this transaction model you can not overwrite blocks on disk. Instead, you should write the modified buffers to different location, and after making sure that they have been written successfully, deallocate old blocks (sometimes this transaction model is called "Copy-on-Write", but we will use the historically first name "Write-Anywhere"). Such mandatory relocations lead to rapid external fragmentation, especially when you perform a lot of overwrites at random offsets. Respectively, the performance rapidly degrades. To improve the situation you need to incessantly run defragmentation tools. Starting from reiser4-for-3.14.1 patch, Reiser4 users now can choose a transaction model which is most suitable for their devices. This is very simple: just specify it by respective mount option. With the patch applied you will have 3 options: 1) '''Journalling''' (mount option "txmod=journal"). In this mode all overwritten buffers (nodes) will be committed via journal (I remind that instead of obsolete "journal block devices" Reiser4 uses more advanced technique of wandering logs). This mode is for HDD users, who complained about fragmentation of reiser4 volumes. I imagine, that this is not a 100% panacea against fragmentation, but it is better than nothing: in this mode the situation with fragmentation has to be not worse than in ReiserFS(v3)! Alas, the 100% panacea (reiser4 repacker) is still a long-term todo. 2) '''Write-Anywhere, aka Copy-on-Write''' (mount option "txmod=wa") All modified nodes in this mode will get new location on disk (like in ZFS, Btrfs, etc). In this mode reiser4 doesn't make active attempts to defragment atoms. In this mode reiser4 will issue minimal number of IOs, however reiser4 volumes will be rapidly fragmented. This option is only for SSD users. 3) '''Hybrid transaction model''' (mount option "txmod=hybrid") This is the default model suggested by Hans Reiser and Josh MacDonald in ~2002. This model uses an advanced feature of reiser4 transaction manager, so-called "compound checkpoints", which means that a part of dirty nodes is committed via journal (overwrite), and another part is committed via write-anywhere technique (i.e. gets another location on disk). All relocate-overwrite decisions in this mode are results of attempts to defragment locality of atoms that are to be committed. Clean nodes of this locality also can be involved to the commit process (their location on disk will be changed, if it provides excellent results). In this model number of issued IOs is not so large as in traditional Journalling model, and fragmentation is not so rapid as in traditional Write-Anywhere (CoW) model. However, such local defragmentation doesn't help a lot in some cases of workload, and I periodically get complaints from users about degradation of reiser4 volumes. So, this model is for HDD users, who don't perform a lot of random overwrites. Once the repacker is ready, I'll recommend this mode for all HDD users (just because pure journalling is anyway suboptimal for HDD drives). Implementation details We introduce a new layer/interface TXMOD (Transaction MODel) called at flush time for reiser4 atoms. Every plugin of this interface is a high-level block allocator, which assigns block numbers to dirty nodes, and, thereby, decides, how those nodes will be committed. Every dirty node of reiser4 atom can be committed by either of the following two ways: 1) via journal; 2) using "write-anywhere" technique. If the allocator doesn't change on-disk location of a node, then this node will be committed using journalling technique (overwrite). Otherwise, it will be committed via write-anywhere technique (relocate) relocate <---- allocate --- > overwrite So, in our interpretation the two traditional "classic" strategies in committing transactions (journalling and "write-anywhere") are just two boundary cases: 1) when all nodes are overwritten, and 2) when all nodes are relocated. Besides those 2 boundary cases we can implement the infinite set of their various combinations, so that user can choose what is really suitable for his needs. How it looks in practice Let's create a large enough file on a reiser4 partition (let it be a 645K /etc/services): # mkfs.reiser4 -o create=reg40 /dev/sdb5 # mount /dev/sdb5 /mnt # cp /etc/services /mnt/. # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that file data is represented by a single extent of 162 blocks starting at block #25. Let's overwrite first 100K of this file in journalling transaction model: # mount /dev/sdb5 -o txmod=journal /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that overwritten nodes occupy the same location on disk, and our extent hasn't beed destroyed (fragmented). Moreover, the modified parent node occupies the same location on disk (block #23). Let's now overwrite first 100K of this file in Write-Anywhere (Copy-on-Write) transaction mode: # mount /dev/sdb5 -o txmod=wa /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (213) LEVEL=2 ITEMS=2 SPACE=3952 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [187] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=32, flags=0x0 UNITS=2 [188(25) 50(137)] ============================================================================== We can see, that first 100K (25 blocks) has been relocated in accordance with "Write-Anywhere" transaction model: initial extent has been split into 2 ones: first unit consists of 25 relocated blocks, which start at block #188, and second unit consists of 137 blocks, which occupy the same location on disk. Modified parent also got new location (block #213 - was #23). Let's calculate total number of IOs issued when overwriting the file in different modes: 1) '''Journalling''' 50 blocks were submitted for data modification (25 has been written to journal, and 25 to permanent location); 2 blocks were submitted to modify parent (block #23 in the dump) (1 to journal, and 1 to permanent location); 2 blocks to modify bitmap (1 to journal, and 1 to permanent location) 2 blocks to modify superblock (1 to journal, and 1 to permanent location) -------------------- Total: 56 blocks. 2) '''Write-Anywhere (Copy-on-Write)''' 25 blocks were submitted (relocated) for data modifications; 1 block was submitted to modify parent, which got new location #213; 2 blocks were submitted to modify bitmap (1 to journal, and 1 to permanent location); 2 blocks were submitted to modify superblock (1 to journal, and 1 to permanent location); NOTE: system blocks (bitmaps, superblock, etc) can not be relocated in reiser4, so we always commit them via journal. --------------------- Total: 30 blocks. So we have 56 IOs issued in journalling mode against 30 IOs in Write-Anywhere. However, fragmentation is a payment for the smaller number of IOs in Write-Anywhere mode (see the last dump, where we have 2 extents). So this transaction model is only for SSD drives, as they are not sensitive to external fragmentation. Again, "journal" is for HDD, and "wa" is for SSD, please, don't confuse! ---------------------------------------------------------------------- MOUNT OPTION INTENDED FOR DEFAULT ---------------------------------------------------------------------- txmod=journal HDD users no ---------------------------------------------------------------------- txmod=wa SSD users no ---------------------------------------------------------------------- txmod=hybrid HDD users, who don't perform yes a lot of random overwrites ---------------------------------------------------------------------- 7f1660b8a0bdc4ce9d2ebf3d338fb98b30248fda 3761 2014-05-08T14:54:02Z Edward 4 Create a page with the description of transaction models supported by Reiser4 Reiser4 supports multiple transaction models. As you probably know, all other file systems implement only a single transaction model. That is, they all are either only journalling (ext3/4, ReiserFS(v3), XFS, jfs, ...), or only "write-anywhere" (ZFS, Btrfs, etc). However, journalling file systems are not the best choice for SSD drives (as they issue larger number of IOs because of double writes - first you should write to journal, and then to the permanent location on disk. As you guess, larger number of IOs means performance drop and reduced life of SSD drives. As to "write-anywhere" file systems: they work badly with HDD drives. Indeed, in accordance with this transaction model you can not overwrite blocks on disk. Instead, you should write the modified buffers to different location, and after making sure that they have been written successfully, deallocate old blocks (sometimes this transaction model is called "Copy-on-Write", but we will use the historically first name "Write-Anywhere"). Such mandatory relocations lead to rapid external fragmentation, especially when you perform a lot of overwrites at random offsets. Respectively, the performance rapidly degrades. To improve the situation you need to incessantly run defragmentation tools. Reiser4 users now can choose a transaction model which is most suitable for their devices. This is very simple: just specify it by respective mount option. With the patch applied you will have 3 options: 1) '''Journalling''' (mount option "txmod=journal"). In this mode all overwritten buffers (nodes) will be committed via journal (I remind that instead of obsolete "journal block devices" Reiser4 uses more advanced technique of wandering logs). This mode is for HDD users, who complained about fragmentation of reiser4 volumes. I imagine, that this is not a 100% panacea against fragmentation, but it is better than nothing: in this mode the situation with fragmentation has to be not worse than in ReiserFS(v3)! Alas, the 100% panacea (reiser4 repacker) is still a long-term todo. 2) '''Write-Anywhere, aka Copy-on-Write''' (mount option "txmod=wa") All modified nodes in this mode will get new location on disk (like in ZFS, Btrfs, etc). In this mode reiser4 doesn't make active attempts to defragment atoms. In this mode reiser4 will issue minimal number of IOs, however reiser4 volumes will be rapidly fragmented. This option is only for SSD users. 3) '''Hybrid transaction model''' (mount option "txmod=hybrid") This is the default model suggested by Hans Reiser and Josh MacDonald in ~2002. This model uses an advanced feature of reiser4 transaction manager, so-called "compound checkpoints", which means that a part of dirty nodes is committed via journal (overwrite), and another part is committed via write-anywhere technique (i.e. gets another location on disk). All relocate-overwrite decisions in this mode are results of attempts to defragment locality of atoms that are to be committed. Clean nodes of this locality also can be involved to the commit process (their location on disk will be changed, if it provides excellent results). In this model number of issued IOs is not so large as in traditional Journalling model, and fragmentation is not so rapid as in traditional Write-Anywhere (CoW) model. However, such local defragmentation doesn't help a lot in some cases of workload, and I periodically get complaints from users about degradation of reiser4 volumes. So, this model is for HDD users, who don't perform a lot of random overwrites. Once the repacker is ready, I'll recommend this mode for all HDD users (just because pure journalling is anyway suboptimal for HDD drives). Implementation details We introduce a new layer/interface TXMOD (Transaction MODel) called at flush time for reiser4 atoms. Every plugin of this interface is a high-level block allocator, which assigns block numbers to dirty nodes, and, thereby, decides, how those nodes will be committed. Every dirty node of reiser4 atom can be committed by either of the following two ways: 1) via journal; 2) using "write-anywhere" technique. If the allocator doesn't change on-disk location of a node, then this node will be committed using journalling technique (overwrite). Otherwise, it will be committed via write-anywhere technique (relocate) relocate <---- allocate --- > overwrite So, in our interpretation the two traditional "classic" strategies in committing transactions (journalling and "write-anywhere") are just two boundary cases: 1) when all nodes are overwritten, and 2) when all nodes are relocated. Besides those 2 boundary cases we can implement the infinite set of their various combinations, so that user can choose what is really suitable for his needs. How it looks in practice Let's create a large enough file on a reiser4 partition (let it be a 645K /etc/services): # mkfs.reiser4 -o create=reg40 /dev/sdb5 # mount /dev/sdb5 /mnt # cp /etc/services /mnt/. # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that file data is represented by a single extent of 162 blocks starting at block #25. Let's overwrite first 100K of this file in journalling transaction model: # mount /dev/sdb5 -o txmod=journal /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (23) LEVEL=2 ITEMS=2 SPACE=3968 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [24] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=16, flags=0x0 UNITS=1 [25(162)] ============================================================================== We can see that overwritten nodes occupy the same location on disk, and our extent hasn't beed destroyed (fragmented). Moreover, the modified parent node occupies the same location on disk (block #23). Let's now overwrite first 100K of this file in Write-Anywhere (Copy-on-Write) transaction mode: # mount /dev/sdb5 -o txmod=wa /mnt # dd if=/dev/zero of=/mnt/services bs=100K count=1 conv=notrunc # umount /mnt # debugfs.reiser4 -t /dev/sdb5 NODE (213) LEVEL=2 ITEMS=2 SPACE=3952 MKFS ID=0x4ed8c6de FLUSH=0x0 #0 NPTR (nodeptr40): [29:1(SD):0:2a:0] OFF=28, LEN=8, flags=0x0 [187] ------------------------------------------------------------------------------ #1 EXTENT (extent40): [2a:4(FB):73657276696365:10000:0] OFF=36, LEN=32, flags=0x0 UNITS=2 [188(25) 50(137)] ============================================================================== We can see, that first 100K (25 blocks) has been relocated in accordance with "Write-Anywhere" transaction model: initial extent has been split into 2 ones: first unit consists of 25 relocated blocks, which start at block #188, and second unit consists of 137 blocks, which occupy the same location on disk. Modified parent also got new location (block #213 - was #23). Let's calculate total number of IOs issued when overwriting the file in different modes: 1) '''Journalling''' 50 blocks were submitted for data modification (25 has been written to journal, and 25 to permanent location); 2 blocks were submitted to modify parent (block #23 in the dump) (1 to journal, and 1 to permanent location); 2 blocks to modify bitmap (1 to journal, and 1 to permanent location) 2 blocks to modify superblock (1 to journal, and 1 to permanent location) -------------------- Total: 56 blocks. 2) '''Write-Anywhere (Copy-on-Write)''' 25 blocks were submitted (relocated) for data modifications; 1 block was submitted to modify parent, which got new location #213; 2 blocks were submitted to modify bitmap (1 to journal, and 1 to permanent location); 2 blocks were submitted to modify superblock (1 to journal, and 1 to permanent location); NOTE: system blocks (bitmaps, superblock, etc) can not be relocated in reiser4, so we always commit them via journal. --------------------- Total: 30 blocks. So we have 56 IOs issued in journalling mode against 30 IOs in Write-Anywhere. However, fragmentation is a payment for the smaller number of IOs in Write-Anywhere mode (see the last dump, where we have 2 extents). So this transaction model is only for SSD drives, as they are not sensitive to external fragmentation. Again, "journal" is for HDD, and "wa" is for SSD, please, don't confuse! ---------------------------------------------------------------------- MOUNT OPTION INTENDED FOR DEFAULT ---------------------------------------------------------------------- txmod=journal HDD users no ---------------------------------------------------------------------- txmod=wa SSD users no ---------------------------------------------------------------------- txmod=hybrid HDD users, who don't perform yes a lot of random overwrites ---------------------------------------------------------------------- 298158853cba641093c891d247790375535fcc90 Reiser4progs 0 5 4263 4261 2017-06-25T00:24:28Z Chris goe 2 +nowiki The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. = <tt>libaal</tt> = To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: VER=<font color="red">1.0.6</font> wget <nowiki>https://downloads.sourceforge.net/</nowiki>project/reiser4/reiser4-utils/libaal/libaal-$VER.tar.gz tar -xzf libaal-$VER.tar.gz cd libaal-$VER Alternatively, the source can also be checked out from via [https://git-scm.com/ Git]: git clone https://github.com/edward6/libaal libaal-git cd libaal-git sh ./prepare Continue with: ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! = reiser4progs = Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo yum install readline-devel libuuid-devel <span class=plainlinks>[https://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-static]</span> # openSUSE, Fedora VER=<font color="red">1.1.0</font> wget <nowiki>https://downloads.sourceforge.net/</nowiki>project/reiser4/reiser4-utils/reiser4progs/reiser4progs-$VER.tar.gz tar -xzf reiser4progs-$VER.tar.gz cd reiser4progs-$VER Alternatively, the source can also be checked out from via [https://git-scm.com/ Git]: git clone https://github.com/edward6/reiser4progs.git reiser4progs-git cd reiser4progs-git sh ./prepare Continue with: ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! = See also = * [[Debug Reiser4progs]] [[category:Reiser4]] c11d38cc0341687d4ac2e24719e96b1b05b2b785 4261 4259 2017-06-25T00:23:04Z Chris goe 2 https everywhere! The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. = <tt>libaal</tt> = To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: VER=<font color="red">1.0.6</font> wget https://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-$VER.tar.gz tar -xzf libaal-$VER.tar.gz cd libaal-$VER Alternatively, the source can also be checked out from via [https://git-scm.com/ Git]: git clone https://github.com/edward6/libaal libaal-git cd libaal-git sh ./prepare Continue with: ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! = reiser4progs = Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo yum install readline-devel libuuid-devel <span class=plainlinks>[https://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-static]</span> # openSUSE, Fedora VER=<font color="red">1.1.0</font> wget https://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-$VER.tar.gz tar -xzf reiser4progs-$VER.tar.gz cd reiser4progs-$VER Alternatively, the source can also be checked out from via [https://git-scm.com/ Git]: git clone https://github.com/edward6/reiser4progs.git reiser4progs-git cd reiser4progs-git sh ./prepare Continue with: ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! = See also = * [[Debug Reiser4progs]] [[category:Reiser4]] 703a7ce5224a482340f87b5b93662700c65bba98 4259 4163 2017-06-25T00:20:31Z Chris goe 2 link to [[Debug Reiser4progs]] The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. = <tt>libaal</tt> = To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: VER=<font color="red">1.0.6</font> wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-$VER.tar.gz tar -xzf libaal-$VER.tar.gz cd libaal-$VER Alternatively, the source can also be checked out from via [https://git-scm.com/ Git]: git clone https://github.com/edward6/libaal libaal-git cd libaal-git sh ./prepare Continue with: ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! = reiser4progs = Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo yum install readline-devel libuuid-devel <span class=plainlinks>[http://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-static]</span> # openSUSE, Fedora VER=<font color="red">1.1.0</font> wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-$VER.tar.gz tar -xzf reiser4progs-$VER.tar.gz cd reiser4progs-$VER Alternatively, the source can also be checked out from via [https://git-scm.com/ Git]: git clone https://github.com/edward6/reiser4progs.git reiser4progs-git cd reiser4progs-git sh ./prepare Continue with: ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! = See also = * [[Debug Reiser4progs]] [[category:Reiser4]] fe150b5addf9e61f1e4ad7571aa454b51d7e71a7 4163 4074 2016-09-24T21:58:15Z Chris goe 2 mention the new Git trees The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. = <tt>libaal</tt> = To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: VER=<font color="red">1.0.6</font> wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-$VER.tar.gz tar -xzf libaal-$VER.tar.gz cd libaal-$VER Alternatively, the source can also be checked out from via [https://git-scm.com/ Git]: git clone https://github.com/edward6/libaal libaal-git cd libaal-git sh ./prepare Continue with: ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! = reiser4progs = Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo yum install readline-devel libuuid-devel <span class=plainlinks>[http://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-static]</span> # openSUSE, Fedora VER=<font color="red">1.1.0</font> wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-$VER.tar.gz tar -xzf reiser4progs-$VER.tar.gz cd reiser4progs-$VER Alternatively, the source can also be checked out from via [https://git-scm.com/ Git]: git clone https://github.com/edward6/reiser4progs.git reiser4progs-git cd reiser4progs-git sh ./prepare Continue with: ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! [[category:Reiser4]] 901c1bff1ddcba1f6b87defb1066108558d0345e 4074 4073 2015-08-30T23:28:45Z Chris goe 2 use $VER so we only have to update this variable and then copy & paste the rest; glibc-devel-static has been renamed to glibc-static in Fedora The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: VER=<font color="red">1.0.6</font> wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-$VER.tar.gz tar -xzf libaal-$VER.tar.gz cd libaal-$VER ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo yum install readline-devel libuuid-devel <span class=plainlinks>[http://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-static]</span> # openSUSE, Fedora VER=<font color="red">1.1.0</font> wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-$VER.tar.gz tar -xzf reiser4progs-$VER.tar.gz cd reiser4progs-$VER ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! [[category:Reiser4]] f6b2b408fe15becaf668aae98d9c98347fd84aa6 4073 2721 2015-08-30T23:04:42Z Edward 4 The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-1.0.6.tar.gz tar -xzf libaal-1.0.6.tar.gz cd libaal-1.0.6 ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo yum install readline-devel libuuid-devel <span class=plainlinks>[http://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-devel-static]</span> # openSUSE, Fedora wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-1.1.0.tar.gz tar -xzf reiser4progs-1.1.0.tar.gz cd reiser4progs-1.1.0 ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! [[category:Reiser4]] a9075306d7b66b7b6ebbc223cec8483ba0dcf874 2721 2441 2013-05-10T18:05:05Z Edward 4 The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-1.0.5.tar.gz tar -xzf libaal-1.0.5.tar.gz cd libaal-1.0.5 ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo yum install readline-devel libuuid-devel <span class=plainlinks>[http://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-devel-static]</span> # openSUSE, Fedora wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-1.0.8.tar.gz tar -xzf reiser4progs-1.0.8.tar.gz cd reiser4progs-1.0.8 ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! [[category:Reiser4]] c2801e6defe6fad9d2c6c0722cc24bb2dfe56bbc 2441 2411 2012-09-25T16:54:32Z Chris goe 2 yum! The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-1.0.5.tar.gz tar -xzf libaal-1.0.5.tar.gz cd libaal-1.0.5 ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo yum install readline-devel libuuid-devel <span class=plainlinks>[http://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-devel-static]</span> # openSUSE, Fedora wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-1.0.7.tar.gz tar -xzf reiser4progs-1.0.7.tar.gz cd reiser4progs-1.0.7 ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! [[category:Reiser4]] 37aebdd838d76c3af2644b454acc5a63be2151a9 2411 2401 2012-09-24T23:10:19Z Chris goe 2 +glibc-devel-static The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-1.0.5.tar.gz tar -xzf libaal-1.0.5.tar.gz cd libaal-1.0.5 ./configure --prefix=/opt/libaal make && sudo make install cd /opt/libaal && ln -s lib64 lib # For 64 bit systems! Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo apt-get install readline-devel libuuid-devel <span class=plainlinks>[http://marc.info/?l=reiserfs-devel&m=114871738809424 glibc-devel-static]</span> # openSUSE, Fedora wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-1.0.7.tar.gz tar -xzf reiser4progs-1.0.7.tar.gz cd reiser4progs-1.0.7 ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! [[category:Reiser4]] 09cf64f736ef1b7b0c7b34adb3eebf3efc4b1ae3 2401 2391 2012-09-24T22:54:23Z Chris goe 2 --with-libaal instead of cflags/ldflags The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-1.0.5.tar.gz tar -xzf libaal-1.0.5.tar.gz cd libaal-1.0.5 ./configure --prefix=/opt/libaal make && sudo make install Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo apt-get install readline-devel libuuid-devel # openSUSE, Fedora wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-1.0.7.tar.gz tar -xzf reiser4progs-1.0.7.tar.gz cd reiser4progs-1.0.7 ./configure --prefix=/opt/reiser4progs ''--with-libaal=/opt/libaal'' make && sudo make install Note: if <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>--with-libaal</tt> flag can be omitted! [[category:Reiser4]] 7714cf2f998cb1977c4e754e2317ba82aa9c394e 2391 2381 2012-09-24T22:41:51Z Chris goe 2 +explain The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. If your distribution does not ship a pre-compiled package, we have to build this manually. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-1.0.5.tar.gz tar -xzf libaal-1.0.5.tar.gz cd libaal-1.0.5 ./configure --prefix=/opt/libaal make && sudo make install Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo apt-get install readline-devel libuuid-devel # openSUSE, Fedora wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-1.0.7.tar.gz tar -xzf reiser4progs-1.0.7.tar.gz cd reiser4progs-1.0.7 CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" ./configure --prefix=/opt/reiser4progs make && sudo make install Note: * Use ''<tt>LDFLAGS="-L/opt/libaal/lib"</tt>'' for 64 bit systems! * If <tt>libaal</tt> has been installed from a distribution package (<tt>libaal-dev</tt> resp. <tt>libaal-devel</tt>), the <tt>CFLAGS</tt> and <tt>LDFLAGS</tt> can be omitted! [[category:Reiser4]] 1afbd11d416742c0aeaf572738edbd68572bcde0 2381 1622 2012-09-24T21:59:33Z Chris goe 2 reiser4progs moved to sourceforge The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://sourceforge.net/projects/reiser4/ Sourceforge]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/libaal/libaal-1.0.5.tar.gz tar -xzf libaal-1.0.5.tar.gz cd libaal-1.0.5 ./configure --prefix=/opt/libaal make && sudo make install Now we can build <tt>reiser4progs</tt>: sudo apt-get install libreadline-dev uuid-dev # Debian, Ubuntu sudo apt-get install readline-devel libuuid-devel # openSUSE, Fedora wget http://downloads.sourceforge.net/project/reiser4/reiser4-utils/reiser4progs/reiser4progs-1.0.7.tar.gz tar -xzf reiser4progs-1.0.7.tar.gz cd reiser4progs-1.0.7 CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" ./configure --prefix=/opt/reiser4progs make && sudo make install Note: use ''<tt>LDFLAGS="-L/opt/libaal/lib"</tt>'' for 64 bit systems! [[category:Reiser4]] 134ea1b67e05c90e80e8a352b836f4b1848c2058 1622 1621 2009-08-25T20:57:05Z Chris goe 2 gpg --recv-keys added The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install Now we can build <tt>reiser4progs</tt>: $ sudo apt-get install libreadline-dev uuid-dev $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install [[category:Reiser4]] e7db509e3457753f003b54cfa72fdfc73ff91dcf 1621 1564 2009-08-25T20:55:06Z Chris goe 2 formatting fixes The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID [http://kernel.org/signature.html 517D0F0E] gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install Now we can build <tt>reiser4progs</tt>: $ sudo apt-get install libreadline-dev uuid-dev $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID [http://kernel.org/signature.html 517D0F0E] gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install [[category:Reiser4]] 42aa4ed503cc81bdd6952d8f1300865689718791 1564 1504 2009-07-03T18:46:29Z Chris goe 2 libreadline-dev needed too The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install </pre> Now we can build <tt>reiser4progs</tt>: <pre> $ sudo apt-get install libreadline-dev uuid-dev $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install </pre> [[category:Reiser4]] aa2e61bce09697cc4e74883e268c4c8757b3f7f4 1504 1457 2009-06-27T17:18:25Z Chris goe 2 uuid-dev is needed too The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install </pre> Now we can build <tt>reiser4progs</tt>: <pre> $ sudo apt-get install uuid-dev $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install </pre> [[category:Reiser4]] 86c23776a48516d5658cc58832abde9bb0e2dc48 1457 1455 2009-06-27T03:56:51Z Chris goe 2 wording The tools to maintain a Reiser4 filesystem are called <tt>reiser4progs</tt> and can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install </pre> Now we can build <tt>reiser4progs</tt>: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install </pre> [[category:Reiser4]] 1a43a0f60015e4ca40ed433cde8b28c64585e12d 1455 1391 2009-06-27T03:55:08Z Chris goe 2 reiser4progs split <tt>reiser4progs</tt> can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install </pre> Now we can build <tt>reiser4progs</tt>: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install </pre> [[category:Reiser4]] eb968744514cc3729320aa0de1bb202c01e2146a 1391 1390 2009-06-26T03:57:46Z Chris goe 2 reiserfsprogs fixed === Reiser4progs === <tt>reiser4progs</tt> can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install </pre> Now we can build <tt>reiser4progs</tt>: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install </pre> === ReiserFSprogs === The tools to maintain a ReiserFS (Reiser v3) are called <tt>reiserfsprogs</tt> and can be found [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ here]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). If <tt>reiserfsprogs</tt> is not already part of your distribution (unlikely, it should be available), we have to build our own: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2.sign $ gpg --verify reiserfsprogs-3.6.21.tar.bz2.sign reiserfsprogs-3.6.21.tar.bz2 gpg: Signature made Sat Jan 10 16:15:04 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiserfsprogs-3.6.21.tar.bz2 $ cd reiserfsprogs-3.6.21 $ ./configure --prefix=/opt/reiserfsprogs && make && sudo make install </pre> [[category:ReiserFS]] [[category:Reiser4]] f6fc781ae69ee85d56eb5cdf8495a7a61bf199cd 1390 1389 2009-06-26T03:50:05Z Chris goe 2 : === Reiser4progs === <tt>reiser4progs</tt> can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install </pre> Now we can build <tt>reiser4progs</tt>: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install </pre> === ReiserFSprogs === The tools to maintain a ReiserFS (Reiser v3) are called <tt>reiserfsprogs</tt> and can be found [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ here]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). [[category:ReiserFS]] [[category:Reiser4]] f61a6695da17275ba4cc1cc0628d32ea9c947ceb 1389 1292 2009-06-26T03:49:42Z Chris goe 2 building reiser4progs === Reiser4progs === <tt>reiser4progs</tt> can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. To compile <tt>reiser4progs</tt>, we'll need <tt>libaal</tt> too: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/libaal/libaal-1.0.5.tar.bz2.sign $ gpg --verify libaal-1.0.5.tar.bz2.sign libaal-1.0.5.tar.bz2 gpg: Signature made Sun Apr 20 01:21:05 2008 CEST using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf libaal-1.0.5.tar.bz2 $ cd libaal-1.0.5 $ ./configure --prefix=/opt/libaal && make && sudo make install </pre> Now we can build <tt>reiser4progs</tt> <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2.sign $ gpg --verify reiser4progs-1.0.7.tar.bz2.sign reiser4progs-1.0.7.tar.bz2 gpg: Signature made Mon Feb 9 17:43:05 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiser4progs-1.0.7.tar.bz2 $ cd reiser4progs-1.0.7 $ CFLAGS="-I/opt/libaal/include" LDFLAGS="-L/opt/libaal/lib" \ ./configure --prefix=/opt/reiser4progs && make && sudo make install </pre> === ReiserFSprogs === The tools to maintain a ReiserFS (Reiser v3) are called <tt>reiserfsprogs</tt> and can be found [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ here]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). [[category:ReiserFS]] [[category:Reiser4]] f65caf27eeceffe3819f6a74070a99e9afb3d906 1292 1291 2009-06-25T06:03:09Z Chris goe 2 category added === Reiser4progs === <tt>reiser4progs</tt> can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. === ReiserFSprogs === The tools to maintain a ReiserFS (Reiser v3) are called <tt>reiserfsprogs</tt> and can be found [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ here]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). [[category:ReiserFS]] [[category:Reiser4]] 70358deecdc1a49a7f1bb506726b1a1ab7cc622f 1291 2009-06-25T06:02:43Z Chris goe 2 from http://www.kernel.org/pub/linux/utils/fs/reiser* === Reiser4progs === <tt>reiser4progs</tt> can be found on [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/ kernel.org]. The current version is [http://www.kernel.org/pub/linux/utils/fs/reiser4/reiser4progs/reiser4progs-1.0.7.tar.bz2 v1.0.7]. === ReiserFSprogs === The tools to maintain a ReiserFS (Reiser v3) are called <tt>reiserfsprogs</tt> and can be found [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ here]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). 9c03dba43d573f71efa399305006c6d0d671c5ce ReiserFS 0 90 1729 1725 2010-04-25T04:18:16Z Chris goe 2 -> Category:ReiserFS #REDIRECT [[:Category:ReiserFS]] [[Category:ReiserFS]] 27c11a9777e717b8748e61a3ff25ed30b63cf6a7 1725 2010-04-25T04:15:04Z Chris goe 2 moved [[ReiserFS]] to [[X0reiserfs]]:&#32;We'll use the ReiserFS article for something else #REDIRECT [[X0reiserfs]] c31bea4e21029eb8b1dc389dffafd4c4482599a1 ReiserFS/kerneloptions 0 20 1375 1323 2009-06-25T09:53:59Z Chris goe 2 formatting fixes == Compile-Time Options for Configuring ReiserFS == These are compile-time options that affect the functionality of [[ReiserFS]]. You can set them up during the configuration stage of the Linux kernel build: make config make menuconfig make xconfig === CONFIG_REISERFS_FS === Build ReiserFS. It may be built either into the kernel or as a stand alone kernel module. === CONFIG_REISERFS_CHECK === If you set this to yes, then ReiserFS will perform every check it can possibly imagine of its internal consistency throughout its operation. It will also go substantially slower. Use of this option allows our team to go all out in checking for consistency when debugging without fear of its effect on the end user. If you are on the verge of sending in a bug report, say yes and you might get a useful error message. Almost everyone else should say No. === CONFIG_REISERFS_PROC_INFO === Create under <tt>/proc/fs/reiserfs</tt> hierarchy of files, displaying various ReiserFS statistics and internal data on the expense of making your own kernel or module slightly larger (8k). This also increases the amount of kernel memory required for each mount. Almost everyone but ReiserFS developers and people fine-tuning ReiserFS or tracing problems should say No. === CONFIG_REISERFS_RAW === Setting this to yes will enable a set of ioctls that provide raw interface to the ReiserFS tree, bypass the directories, and automatically remove aged files. This is an experimental feature designed for squid cache directories. See <tt>Documentation/filesystems/reiserfs_raw.txt</tt> This was designed specifically to use ReiserFS as a back-end for the [http://www.squid-cache.org/ Squid]. The general idea is that it is possible to bypass all filesystem overhead and to directly address the ReiserFS internal tree. '''This is not in the stock kernels'''. === REISERFS_HANDLE_BADBLOCKS === Enables ioctl for manipulating the bitmap. This can used as a crude form of badblock handling. The real solution is still underway. This variable is unavailable through kernel configuration procedures. Edit <tt>include/linux/reiserfs_fs.h</tt> manually, then take a look at the available [[mount| mount options]]. [[category:ReiserFS]] 6c6163e37b9bb0f2968090bfdc637a69d497e961 1323 2009-06-25T07:25:41Z Chris goe 2 http://web.archive.org/web/20061113154555/http://www.namesys.com/config.html Compile-Time Options for Configuring ReiserFS These are compile-time options that affect the functionality of ReiserFS. You can set them up during the configuration stage of the Linux kernel build: make config make menuconfig make xconfig CONFIG_REISERFS_FS Build ReiserFS. It may be built either into the kernel or as a stand alone kernel module. CONFIG_REISERFS_CHECK If you set this to yes, then ReiserFS will perform every check it can possibly imagine of its internal consistency throughout its operation. It will also go substantially slower. Use of this option allows our team to go all out in checking for consistency when debugging without fear of its effect on the end user. If you are on the verge of sending in a bug report, say yes and you might get a useful error message. Almost everyone else should say No. CONFIG_REISERFS_PROC_INFO Create under /proc/fs/reiserfs hierarchy of files, displaying various ReiserFS statistics and internal data on the expense of making your own kernel or module slightly larger (8k). This also increases the amount of kernel memory required for each mount. Almost everyone but ReiserFS developers and people fine-tuning ReiserFS or tracing problems should say No. CONFIG_REISERFS_RAW Setting this to yes will enable a set of ioctls that provide raw interface to the ReiserFS tree, bypass the directories, and automatically remove aged files. This is an experimental feature designed for squid cache directories. See Documentation/filesystems/reiserfs_raw.txt This was designed specifically to use ReiserFS as a back-end for the Squid. The general idea is that it is possible to bypass all filesystem overhead and to directly address the ReiserFS internal tree. This is not in the stock kernels. REISERFS_HANDLE_BADBLOCKS Enables ioctl for manipulating the bit-map. This can used as a crude form of bad-block handling. The real solution is still underway. This variable is unavailable through kernel configuration procedures. Edit include/linux/reiserfs_fs.h manually, then take a look at the available mount options. Last modified: Fri Apr 26 15:19:07 MSD 2002 (maintained by flx@namesys.com). [[category:ReiserFS]] 9d3cf26a467f1ec67d319f2ad6cfe281984946ba Reiserfsck 0 23 1551 1532 2009-07-02T19:58:20Z Chris goe 2 listaddress added === NAME === reiserfsck - The checking tool for the [[ReiserFS]] filesystem. === SYNOPSIS === reiserfsck [ -afprVy ] [ --rebuild-sb | --check | --fix-fixable | --rebuild-tree | --clean-attributes ] [ -j | --journal device ] [ -z | --adjust-size ] [ -n | --nolog ] [ -B | --badblocks file ] [ -l | --logfile file ] [ -q | --quiet ] [ -y | --yes ] [ -S | --scan-whole-partition ] [ --no-journal-available ] ''device'' === DESCRIPTION === <tt>reiserfsck</tt> searches for a ReiserFS filesystem on a ''device'', replays any necessary transactions, and either checks or repairs the file system. ''device'' is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === --rebuild-sb This option recovers the superblock on a ReiserFS partition. Normally you only need this option if mount reports "read_super_block: can't find a reiserfs file system" and you are sure that a Reiserfs file system is there. But remember that if you have used some partition editor program and now you cannot find a filesystem, probably something has gone wrong while repartitioning and the start of the partition has been changed. If so, instead of rebuilding the super block on a wrong place you should find the correct start of the partition first. --check This default action checks filesystem consistency and reports, but does not repair any corruption that it finds. This option may be used on a read-only file system mount. --fix-fixable This option recovers certain kinds of corruption that do not require rebuilding the entire file system tree (--rebuild-tree). Normally you only need this option if the --check option reports "corruption that can be fixed with --fix-fixable". This includes: zeroing invalid data-block pointers, correcting st_size and st_blocks for directories, and deleting invalid directory entries. --rebuild-tree This option rebuilds the entire filesystem tree using leaf nodes found on the device. Normally you only need this option if the reiserfsck --check reports "Running with --rebuild-tree is required". You are strongly encouraged to make a backup copy of the whole partition before attempting the --rebuild-tree option. Once reiserfsck --rebuild-tree is started it must finish its work (and you should not interrupt it), otherwise the filesystem will be left in the unmountable state to avoid subsequent data corruptions. --clean-attributes This option cleans reserved fields of stat-data items. There were days when there were no extended attributes in ReiserFS. When they were implemented old partitions needed to be cleaned first -- ReiserFS code in the kernel did not care about not used fields in its strutures. Thus if you have used one of the old (pre-attrbutes) kernels with a ReiserFS filesystem and you want to use extented attribues there, you should clean the filesystem first. --journal device , -j device This option supplies the device name of the current file system journal. This option is required when the journal resides on a separate device from the main data device (although it can be avoided with the expert option --no-journal-available). --adjust-size, -z This option causes reiserfsck to correct file sizes that are larger than the offset of the last discovered byte. This implies that holes at the end of a file will be removed. File sizes that are smaller than the offset of the last discovered byte are corrected by --fix-fixable. --badblocks file, -B file This option sets the badblock list to be the list of blocks specified in the given 'file'. The filesystem [[FAQ/bad-block-handling|badblock list]] is cleared before the new list is added. It can be used with --fix-fixable to fix the list of badblocks (see [[debugreiserfs]] -B). If the device has bad blocks, every time it must be given with the --rebuild-tree option. --logfile file, -l file This option causes reiserfsck to report any corruption it finds to the specified log file rather than to stderr. --nolog, -n This option prevents reiserfsck from reporting any kinds of corruption. --quiet, -q This option prevents reiserfsck from reporting its rate of progress. --yes, -y This option inhibits reiserfsck from asking you for confirmation after telling you what it is going to do. It will assuem you confirm. For safety, it does not work with the --rebuild-tree option. -a, -p These options are usually passed by fsck -A during the automatic checking of those partitions listed in /etc/fstab. These options cause reiserfsck to print some information about the specified filesystem, to check if error flags in the superblock are set and to do some light-weight checks. If these checks reveal a corruption or the flag indicating a (possibly fixable) corruption is found set in the superblock, then reiserfsck switches to the fix-fixable mode. If the flag indicating a fatal corruption is found set in the superblock, then reiserfsck finishes with an error. -V This option prints the reiserfsprogs version and then exit. -r, -f These options are not yet operational and therefore are ignored. === EXPERT OPTIONS === '''DO NOT USE THESE OPTIONS UNLESS YOU KNOW WHAT YOU ARE DOING! WE ARE NOT RESPONSIBLE IF YOU LOSE DATA AS A RESULT OF THESE OPTIONS!''' --no-journal-available This option allows reiserfsck to proceed when the journal device is not available. This option has no effect when the journal is located on the main data device. NOTE: after this operation you must use reiserfstune to specify a new journal device. --scan-whole-partition, -S This option causes --rebuild-tree to scan the whole partition but not only the used space on the partition. === EXAMPLES === # You think something may be wrong with a reiserfs partition on <tt>/dev/hda1</tt> or you would just like to perform a periodic disk check. # Run <tt>reiserfsck --check --logfile check.log /dev/hda1</tt>. If <tt>reiserfsck --check</tt> exits with status 0 it means no errors were discovered. # If <tt>reiserfsck --check</tt> exits with status 1 (and reports about fixable corruptions) it means that you should run <tt>reiserfsck --fix-fixable --logfile fixable.log /dev/hda1</tt>. # If <tt>reiserfsck --check</tt> exits with status 2 (and reports about fatal corruptions) it means that you need to run <tt>reiserfsck --rebuild-tree</tt>. If <tt>reiserfsck --check</tt> fails in some way you should also run <tt>reiserfsck --rebuild-tree</tt>, but we also encourage you to [[mailinglists|submit this as a bug report]]. # Before running <tt>reiserfsck --rebuild-tree</tt>, please make a backup of the whole partition before proceeding. Then run <tt>reiserfsck --rebuild-tree --logfile rebuild.log /dev/hda1</tt>. # If the <tt>reiserfsck --rebuild-tree</tt> step fails or does not recover what you expected, please [[mailinglists|submit this as a bug report]]. Try to provide as much information as possible including your platform and Linux kernel version. We will try to help solve the problem. === EXIT CODES === <tt>reiserfsck</tt> uses the following exit codes: * 0 - No errors. * 1 - File system errors corrected. * 2 - Reboot is needed. * 4 - File system fatal errors left uncorrected, reiserfsck --rebuild-tree needs to be launched. * 6 - File system fixable errors left uncorrected, reiserfsck --fix-fixable needs to be launched. * 8 - Operational error. * 16 - Usage or syntax error. === AUTHOR === This version of reiserfsck has been written by Vitaly Fertman. === BUGS === Please report bugs to the ReiserFS developers {{listaddress}}, providing as much information as possible - your hardware, kernel, patches, settings, all printed messages, the logfile; check the syslog file for any related information. === TODO === Faster recovering, signal handling. === SEE ALSO === * [[mkreiserfs|mkreiserfs(8)]] * [[reiserfstune|reiserfstune(8)]] * [[resize_reiserfs|resize_reiserfs(8)]] * [[debugreiserfs|debugreiserfs(8)]] [[category:ReiserFS]] 05a4c47832abc715099ad3ecf284d9ed3d6b33ed 1532 1531 2009-06-27T21:13:43Z Chris goe 2 examples === NAME === reiserfsck - The checking tool for the [[ReiserFS]] filesystem. === SYNOPSIS === reiserfsck [ -afprVy ] [ --rebuild-sb | --check | --fix-fixable | --rebuild-tree | --clean-attributes ] [ -j | --journal device ] [ -z | --adjust-size ] [ -n | --nolog ] [ -B | --badblocks file ] [ -l | --logfile file ] [ -q | --quiet ] [ -y | --yes ] [ -S | --scan-whole-partition ] [ --no-journal-available ] ''device'' === DESCRIPTION === <tt>reiserfsck</tt> searches for a ReiserFS filesystem on a ''device'', replays any necessary transactions, and either checks or repairs the file system. ''device'' is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === --rebuild-sb This option recovers the superblock on a ReiserFS partition. Normally you only need this option if mount reports "read_super_block: can't find a reiserfs file system" and you are sure that a Reiserfs file system is there. But remember that if you have used some partition editor program and now you cannot find a filesystem, probably something has gone wrong while repartitioning and the start of the partition has been changed. If so, instead of rebuilding the super block on a wrong place you should find the correct start of the partition first. --check This default action checks filesystem consistency and reports, but does not repair any corruption that it finds. This option may be used on a read-only file system mount. --fix-fixable This option recovers certain kinds of corruption that do not require rebuilding the entire file system tree (--rebuild-tree). Normally you only need this option if the --check option reports "corruption that can be fixed with --fix-fixable". This includes: zeroing invalid data-block pointers, correcting st_size and st_blocks for directories, and deleting invalid directory entries. --rebuild-tree This option rebuilds the entire filesystem tree using leaf nodes found on the device. Normally you only need this option if the reiserfsck --check reports "Running with --rebuild-tree is required". You are strongly encouraged to make a backup copy of the whole partition before attempting the --rebuild-tree option. Once reiserfsck --rebuild-tree is started it must finish its work (and you should not interrupt it), otherwise the filesystem will be left in the unmountable state to avoid subsequent data corruptions. --clean-attributes This option cleans reserved fields of stat-data items. There were days when there were no extended attributes in ReiserFS. When they were implemented old partitions needed to be cleaned first -- ReiserFS code in the kernel did not care about not used fields in its strutures. Thus if you have used one of the old (pre-attrbutes) kernels with a ReiserFS filesystem and you want to use extented attribues there, you should clean the filesystem first. --journal device , -j device This option supplies the device name of the current file system journal. This option is required when the journal resides on a separate device from the main data device (although it can be avoided with the expert option --no-journal-available). --adjust-size, -z This option causes reiserfsck to correct file sizes that are larger than the offset of the last discovered byte. This implies that holes at the end of a file will be removed. File sizes that are smaller than the offset of the last discovered byte are corrected by --fix-fixable. --badblocks file, -B file This option sets the badblock list to be the list of blocks specified in the given 'file'. The filesystem [[FAQ/bad-block-handling|badblock list]] is cleared before the new list is added. It can be used with --fix-fixable to fix the list of badblocks (see [[debugreiserfs]] -B). If the device has bad blocks, every time it must be given with the --rebuild-tree option. --logfile file, -l file This option causes reiserfsck to report any corruption it finds to the specified log file rather than to stderr. --nolog, -n This option prevents reiserfsck from reporting any kinds of corruption. --quiet, -q This option prevents reiserfsck from reporting its rate of progress. --yes, -y This option inhibits reiserfsck from asking you for confirmation after telling you what it is going to do. It will assuem you confirm. For safety, it does not work with the --rebuild-tree option. -a, -p These options are usually passed by fsck -A during the automatic checking of those partitions listed in /etc/fstab. These options cause reiserfsck to print some information about the specified filesystem, to check if error flags in the superblock are set and to do some light-weight checks. If these checks reveal a corruption or the flag indicating a (possibly fixable) corruption is found set in the superblock, then reiserfsck switches to the fix-fixable mode. If the flag indicating a fatal corruption is found set in the superblock, then reiserfsck finishes with an error. -V This option prints the reiserfsprogs version and then exit. -r, -f These options are not yet operational and therefore are ignored. === EXPERT OPTIONS === '''DO NOT USE THESE OPTIONS UNLESS YOU KNOW WHAT YOU ARE DOING! WE ARE NOT RESPONSIBLE IF YOU LOSE DATA AS A RESULT OF THESE OPTIONS!''' --no-journal-available This option allows reiserfsck to proceed when the journal device is not available. This option has no effect when the journal is located on the main data device. NOTE: after this operation you must use reiserfstune to specify a new journal device. --scan-whole-partition, -S This option causes --rebuild-tree to scan the whole partition but not only the used space on the partition. === EXAMPLES === # You think something may be wrong with a reiserfs partition on <tt>/dev/hda1</tt> or you would just like to perform a periodic disk check. # Run <tt>reiserfsck --check --logfile check.log /dev/hda1</tt>. If <tt>reiserfsck --check</tt> exits with status 0 it means no errors were discovered. # If <tt>reiserfsck --check</tt> exits with status 1 (and reports about fixable corruptions) it means that you should run <tt>reiserfsck --fix-fixable --logfile fixable.log /dev/hda1</tt>. # If <tt>reiserfsck --check</tt> exits with status 2 (and reports about fatal corruptions) it means that you need to run <tt>reiserfsck --rebuild-tree</tt>. If <tt>reiserfsck --check</tt> fails in some way you should also run <tt>reiserfsck --rebuild-tree</tt>, but we also encourage you to [[mailinglists|submit this as a bug report]]. # Before running <tt>reiserfsck --rebuild-tree</tt>, please make a backup of the whole partition before proceeding. Then run <tt>reiserfsck --rebuild-tree --logfile rebuild.log /dev/hda1</tt>. # If the <tt>reiserfsck --rebuild-tree</tt> step fails or does not recover what you expected, please [[mailinglists|submit this as a bug report]]. Try to provide as much information as possible including your platform and Linux kernel version. We will try to help solve the problem. === EXIT CODES === <tt>reiserfsck</tt> uses the following exit codes: * 0 - No errors. * 1 - File system errors corrected. * 2 - Reboot is needed. * 4 - File system fatal errors left uncorrected, reiserfsck --rebuild-tree needs to be launched. * 6 - File system fixable errors left uncorrected, reiserfsck --fix-fixable needs to be launched. * 8 - Operational error. * 16 - Usage or syntax error. === AUTHOR === This version of reiserfsck has been written by Vitaly Fertman. === BUGS === Please [[mailinglists|report bugs to the ReiserFS developers]], providing as much information as possible--your hardware, kernel, patches, settings, all printed messages, the logfile; check the syslog file for any related information. === TODO === Faster recovering, signal handling. === SEE ALSO === * [[mkreiserfs|mkreiserfs(8)]] * [[reiserfstune|reiserfstune(8)]] * [[resize_reiserfs|resize_reiserfs(8)]] * [[debugreiserfs|debugreiserfs(8)]] [[category:ReiserFS]] e02b076da3e8c853ce6454e13415aef646494f0a 1531 1367 2009-06-27T21:13:01Z Chris goe 2 formatting fixes === NAME === reiserfsck - The checking tool for the [[ReiserFS]] filesystem. === SYNOPSIS === reiserfsck [ -afprVy ] [ --rebuild-sb | --check | --fix-fixable | --rebuild-tree | --clean-attributes ] [ -j | --journal device ] [ -z | --adjust-size ] [ -n | --nolog ] [ -B | --badblocks file ] [ -l | --logfile file ] [ -q | --quiet ] [ -y | --yes ] [ -S | --scan-whole-partition ] [ --no-journal-available ] ''device'' === DESCRIPTION === <tt>reiserfsck</tt> searches for a ReiserFS filesystem on a ''device'', replays any necessary transactions, and either checks or repairs the file system. ''device'' is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === --rebuild-sb This option recovers the superblock on a ReiserFS partition. Normally you only need this option if mount reports "read_super_block: can't find a reiserfs file system" and you are sure that a Reiserfs file system is there. But remember that if you have used some partition editor program and now you cannot find a filesystem, probably something has gone wrong while repartitioning and the start of the partition has been changed. If so, instead of rebuilding the super block on a wrong place you should find the correct start of the partition first. --check This default action checks filesystem consistency and reports, but does not repair any corruption that it finds. This option may be used on a read-only file system mount. --fix-fixable This option recovers certain kinds of corruption that do not require rebuilding the entire file system tree (--rebuild-tree). Normally you only need this option if the --check option reports "corruption that can be fixed with --fix-fixable". This includes: zeroing invalid data-block pointers, correcting st_size and st_blocks for directories, and deleting invalid directory entries. --rebuild-tree This option rebuilds the entire filesystem tree using leaf nodes found on the device. Normally you only need this option if the reiserfsck --check reports "Running with --rebuild-tree is required". You are strongly encouraged to make a backup copy of the whole partition before attempting the --rebuild-tree option. Once reiserfsck --rebuild-tree is started it must finish its work (and you should not interrupt it), otherwise the filesystem will be left in the unmountable state to avoid subsequent data corruptions. --clean-attributes This option cleans reserved fields of stat-data items. There were days when there were no extended attributes in ReiserFS. When they were implemented old partitions needed to be cleaned first -- ReiserFS code in the kernel did not care about not used fields in its strutures. Thus if you have used one of the old (pre-attrbutes) kernels with a ReiserFS filesystem and you want to use extented attribues there, you should clean the filesystem first. --journal device , -j device This option supplies the device name of the current file system journal. This option is required when the journal resides on a separate device from the main data device (although it can be avoided with the expert option --no-journal-available). --adjust-size, -z This option causes reiserfsck to correct file sizes that are larger than the offset of the last discovered byte. This implies that holes at the end of a file will be removed. File sizes that are smaller than the offset of the last discovered byte are corrected by --fix-fixable. --badblocks file, -B file This option sets the badblock list to be the list of blocks specified in the given 'file'. The filesystem [[FAQ/bad-block-handling|badblock list]] is cleared before the new list is added. It can be used with --fix-fixable to fix the list of badblocks (see [[debugreiserfs]] -B). If the device has bad blocks, every time it must be given with the --rebuild-tree option. --logfile file, -l file This option causes reiserfsck to report any corruption it finds to the specified log file rather than to stderr. --nolog, -n This option prevents reiserfsck from reporting any kinds of corruption. --quiet, -q This option prevents reiserfsck from reporting its rate of progress. --yes, -y This option inhibits reiserfsck from asking you for confirmation after telling you what it is going to do. It will assuem you confirm. For safety, it does not work with the --rebuild-tree option. -a, -p These options are usually passed by fsck -A during the automatic checking of those partitions listed in /etc/fstab. These options cause reiserfsck to print some information about the specified filesystem, to check if error flags in the superblock are set and to do some light-weight checks. If these checks reveal a corruption or the flag indicating a (possibly fixable) corruption is found set in the superblock, then reiserfsck switches to the fix-fixable mode. If the flag indicating a fatal corruption is found set in the superblock, then reiserfsck finishes with an error. -V This option prints the reiserfsprogs version and then exit. -r, -f These options are not yet operational and therefore are ignored. === EXPERT OPTIONS === '''DO NOT USE THESE OPTIONS UNLESS YOU KNOW WHAT YOU ARE DOING! WE ARE NOT RESPONSIBLE IF YOU LOSE DATA AS A RESULT OF THESE OPTIONS!''' --no-journal-available This option allows reiserfsck to proceed when the journal device is not available. This option has no effect when the journal is located on the main data device. NOTE: after this operation you must use reiserfstune to specify a new journal device. --scan-whole-partition, -S This option causes --rebuild-tree to scan the whole partition but not only the used space on the partition. === AN EXAMPLE OF USING reiserfsck === # You think something may be wrong with a reiserfs partition on <tt>/dev/hda1</tt> or you would just like to perform a periodic disk check. # Run <tt>reiserfsck --check --logfile check.log /dev/hda1</tt>. If <tt>reiserfsck --check</tt> exits with status 0 it means no errors were discovered. # If <tt>reiserfsck --check</tt> exits with status 1 (and reports about fixable corruptions) it means that you should run <tt>reiserfsck --fix-fixable --logfile fixable.log /dev/hda1</tt>. # If <tt>reiserfsck --check</tt> exits with status 2 (and reports about fatal corruptions) it means that you need to run <tt>reiserfsck --rebuild-tree</tt>. If <tt>reiserfsck --check</tt> fails in some way you should also run <tt>reiserfsck --rebuild-tree</tt>, but we also encourage you to [[mailinglists|submit this as a bug report]]. # Before running <tt>reiserfsck --rebuild-tree</tt>, please make a backup of the whole partition before proceeding. Then run <tt>reiserfsck --rebuild-tree --logfile rebuild.log /dev/hda1</tt>. # If the <tt>reiserfsck --rebuild-tree</tt> step fails or does not recover what you expected, please [[mailinglists|submit this as a bug report]]. Try to provide as much information as possible including your platform and Linux kernel version. We will try to help solve the problem. === EXIT CODES === <tt>reiserfsck</tt> uses the following exit codes: * 0 - No errors. * 1 - File system errors corrected. * 2 - Reboot is needed. * 4 - File system fatal errors left uncorrected, reiserfsck --rebuild-tree needs to be launched. * 6 - File system fixable errors left uncorrected, reiserfsck --fix-fixable needs to be launched. * 8 - Operational error. * 16 - Usage or syntax error. === AUTHOR === This version of reiserfsck has been written by Vitaly Fertman. === BUGS === Please [[mailinglists|report bugs to the ReiserFS developers]], providing as much information as possible--your hardware, kernel, patches, settings, all printed messages, the logfile; check the syslog file for any related information. === TODO === Faster recovering, signal handling. === SEE ALSO === * [[mkreiserfs|mkreiserfs(8)]] * [[reiserfstune|reiserfstune(8)]] * [[resize_reiserfs|resize_reiserfs(8)]] * [[debugreiserfs|debugreiserfs(8)]] [[category:ReiserFS]] d50392a0788006a8c3cada35bbe51cf85a75a1bb 1367 1329 2009-06-25T09:10:02Z Chris goe 2 category added reiserfsck - The checking tool for the ReiserFS filesystem. === SYNOPSIS === reiserfsck [ -afprVy ] [ --rebuild-sb | --check | --fix-fixable | --rebuild-tree | --clean-attributes ] [ -j | --journal device ] [ -z | --adjust-size ] [ -n | --nolog ] [ -B | --badblocks file ] [ -l | --logfile file ] [ -q | --quiet ] [ -y | --yes ] [ -S | --scan-whole-partition ] [ --no-journal-available ] device === DESCRIPTION === Reiserfsck searches for a Reiserfs filesystem on a device, replays any necessary transactions, and either checks or repairs the file system. device is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === --rebuild-sb This option recovers the superblock on a Reiserfs partition. Normally you only need this option if mount reports "read_super_block: can't find a reiserfs file system" and you are sure that a Reiserfs file system is there. But remember that if you have used some partition editor program and now you cannot find a filesystem, probably something has gone wrong while repartitioning and the start of the partition has been changed. If so, instead of rebuilding the super block on a wrong place you should find the correct start of the partition first. --check This default action checks filesystem consistency and reports, but does not repair any corruption that it finds. This option may be used on a read-only file system mount. --fix-fixable This option recovers certain kinds of corruption that do not require rebuilding the entire file system tree (--rebuild-tree). Normally you only need this option if the --check option reports "corruption that can be fixed with --fix-fixable". This includes: zeroing invalid data-block pointers, correcting st_size and st_blocks for directories, and deleting invalid directory entries. --rebuild-tree This option rebuilds the entire filesystem tree using leaf nodes found on the device. Normally you only need this option if the reiserfsck --check reports "Running with --rebuild-tree is required". You are strongly encouraged to make a backup copy of the whole partition before attempting the --rebuild-tree option. Once reiserfsck --rebuild-tree is started it must finish its work (and you should not interrupt it), otherwise the filesystem will be left in the unmountable state to avoid subsequent data corruptions. --clean-attributes This option cleans reserved fields of Stat-Data items. There were days when there were no extended attributes in reiserfs. When they were implemented old partitions needed to be cleaned first -- reiserfs code in the kernel did not care about not used fields in its strutures. Thus if you have used one of the old (pre-attrbutes) kernels with a ReiserFS filesystem and you want to use extented attribues there, you should clean the filesystem first. --journal device , -j device This option supplies the device name of the current file system journal. This option is required when the journal resides on a separate device from the main data device (although it can be avoided with the expert option --no-journal-available). --adjust-size, -z This option causes reiserfsck to correct file sizes that are larger than the offset of the last discovered byte. This implies that holes at the end of a file will be removed. File sizes that are smaller than the offset of the last discovered byte are corrected by --fix-fixable. --badblocks file, -B file This option sets the badblock list to be the list of blocks specified in the given `file`. The filesystem badblock list is cleared before the new list is added. It can be used with --fix-fixable to fix the list of badblocks (see debugreiserfs -B). If the device has bad blocks, every time it must be given with the --rebuild-tree option. --logfile file, -l file This option causes reiserfsck to report any corruption it finds to the specified log file rather than to stderr. --nolog, -n This option prevents reiserfsck from reporting any kinds of corruption. --quiet, -q This option prevents reiserfsck from reporting its rate of progress. --yes, -y This option inhibits reiserfsck from asking you for confirmation after telling you what it is going to do. It will assuem you confirm. For safety, it does not work with the --rebuild-tree option. -a, -p These options are usually passed by fsck -A during the automatic checking of those partitions listed in /etc/fstab. These options cause reiserfsck to print some information about the specified filesystem, to check if error flags in the superblock are set and to do some light-weight checks. If these checks reveal a corruption or the flag indicating a (possibly fixable) corruption is found set in the superblock, then reiserfsck switches to the fix-fixable mode. If the flag indicating a fatal corruption is found set in the superblock, then reiserfsck finishes with an error. -V This option prints the reiserfsprogs version and then exit. -r, -f These options are not yet operational and therefore are ignored. === EXPERT OPTIONS === DO NOT USE THESE OPTIONS UNLESS YOU KNOW WHAT YOU ARE DOING. WE ARE NOT RESPONSIBLE IF YOU LOSE DATA AS A RESULT OF THESE OPTIONS. --no-journal-available This option allows reiserfsck to proceed when the journal device is not available. This option has no effect when the journal is located on the main data device. NOTE: after this operation you must use reiserfstune to specify a new journal device. --scan-whole-partition, -S This option causes --rebuild-tree to scan the whole partition but not only the used space on the partition. === AN EXAMPLE OF USING reiserfsck === 1. You think something may be wrong with a reiserfs partition on /dev/hda1 or you would just like to perform a periodic disk check. 2. Run reiserfsck --check --logfile check.log /dev/hda1. If reiserfsck --check exits with status 0 it means no errors were discovered. 3. If reiserfsck --check exits with status 1 (and reports about fixable corruptions) it means that you should run reiserfsck --fix-fixable --logfile fixable.log /dev/hda1. 4. If reiserfsck --check exits with status 2 (and reports about fatal corruptions) it means that you need to run reiserfsck --rebuild-tree. If reiserfsck --check fails in some way you should also run reiserfsck --rebuild-tree, but we also encourage you to submit this as a bug report. 5. Before running reiserfsck --rebuild-tree, please make a backup of the whole partition before proceeding. Then run reiserfsck --rebuild-tree --logfile rebuild.log /dev/hda1. 6. If the reiserfsck --rebuild-tree step fails or does not recover what you expected, please submit this as a bug report. Try to provide as much information as possible including your platform and Linux kernel version. We will try to help solve the problem. === EXIT CODES === reiserfsck uses the following exit codes: 0 - No errors. 1 - File system errors corrected. 2 - Reboot is needed. 4 - File system fatal errors left uncorrected, reiserfsck --rebuild-tree needs to be launched. 6 - File system fixable errors left uncorrected, reiserfsck --fix-fixable needs to be launched. 8 - Operational error. 16 - Usage or syntax error. === AUTHOR === This version of reiserfsck has been written by Vitaly Fertman <vitaly@namesys.com>. === BUGS === Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages, the logfile; check the syslog file for any related information. === TODO === Faster recovering, signal handling. === SEE ALSO === [[mkreiserfs|mkreiserfs(8)]], [[reiserfstune|reiserfstune(8)]], [[resize_reiserfs|resize_reiserfs(8)]], [[debugreiserfs|debugreiserfs(8)]] [[category:ReiserFS]] e3acfa9be268a50df97c70cec68ce41d95450a56 1329 1328 2009-06-25T07:37:16Z Chris goe 2 reiserfsck - The checking tool for the ReiserFS filesystem. === SYNOPSIS === reiserfsck [ -afprVy ] [ --rebuild-sb | --check | --fix-fixable | --rebuild-tree | --clean-attributes ] [ -j | --journal device ] [ -z | --adjust-size ] [ -n | --nolog ] [ -B | --badblocks file ] [ -l | --logfile file ] [ -q | --quiet ] [ -y | --yes ] [ -S | --scan-whole-partition ] [ --no-journal-available ] device === DESCRIPTION === Reiserfsck searches for a Reiserfs filesystem on a device, replays any necessary transactions, and either checks or repairs the file system. device is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). === OPTIONS === --rebuild-sb This option recovers the superblock on a Reiserfs partition. Normally you only need this option if mount reports "read_super_block: can't find a reiserfs file system" and you are sure that a Reiserfs file system is there. But remember that if you have used some partition editor program and now you cannot find a filesystem, probably something has gone wrong while repartitioning and the start of the partition has been changed. If so, instead of rebuilding the super block on a wrong place you should find the correct start of the partition first. --check This default action checks filesystem consistency and reports, but does not repair any corruption that it finds. This option may be used on a read-only file system mount. --fix-fixable This option recovers certain kinds of corruption that do not require rebuilding the entire file system tree (--rebuild-tree). Normally you only need this option if the --check option reports "corruption that can be fixed with --fix-fixable". This includes: zeroing invalid data-block pointers, correcting st_size and st_blocks for directories, and deleting invalid directory entries. --rebuild-tree This option rebuilds the entire filesystem tree using leaf nodes found on the device. Normally you only need this option if the reiserfsck --check reports "Running with --rebuild-tree is required". You are strongly encouraged to make a backup copy of the whole partition before attempting the --rebuild-tree option. Once reiserfsck --rebuild-tree is started it must finish its work (and you should not interrupt it), otherwise the filesystem will be left in the unmountable state to avoid subsequent data corruptions. --clean-attributes This option cleans reserved fields of Stat-Data items. There were days when there were no extended attributes in reiserfs. When they were implemented old partitions needed to be cleaned first -- reiserfs code in the kernel did not care about not used fields in its strutures. Thus if you have used one of the old (pre-attrbutes) kernels with a ReiserFS filesystem and you want to use extented attribues there, you should clean the filesystem first. --journal device , -j device This option supplies the device name of the current file system journal. This option is required when the journal resides on a separate device from the main data device (although it can be avoided with the expert option --no-journal-available). --adjust-size, -z This option causes reiserfsck to correct file sizes that are larger than the offset of the last discovered byte. This implies that holes at the end of a file will be removed. File sizes that are smaller than the offset of the last discovered byte are corrected by --fix-fixable. --badblocks file, -B file This option sets the badblock list to be the list of blocks specified in the given `file`. The filesystem badblock list is cleared before the new list is added. It can be used with --fix-fixable to fix the list of badblocks (see debugreiserfs -B). If the device has bad blocks, every time it must be given with the --rebuild-tree option. --logfile file, -l file This option causes reiserfsck to report any corruption it finds to the specified log file rather than to stderr. --nolog, -n This option prevents reiserfsck from reporting any kinds of corruption. --quiet, -q This option prevents reiserfsck from reporting its rate of progress. --yes, -y This option inhibits reiserfsck from asking you for confirmation after telling you what it is going to do. It will assuem you confirm. For safety, it does not work with the --rebuild-tree option. -a, -p These options are usually passed by fsck -A during the automatic checking of those partitions listed in /etc/fstab. These options cause reiserfsck to print some information about the specified filesystem, to check if error flags in the superblock are set and to do some light-weight checks. If these checks reveal a corruption or the flag indicating a (possibly fixable) corruption is found set in the superblock, then reiserfsck switches to the fix-fixable mode. If the flag indicating a fatal corruption is found set in the superblock, then reiserfsck finishes with an error. -V This option prints the reiserfsprogs version and then exit. -r, -f These options are not yet operational and therefore are ignored. === EXPERT OPTIONS === DO NOT USE THESE OPTIONS UNLESS YOU KNOW WHAT YOU ARE DOING. WE ARE NOT RESPONSIBLE IF YOU LOSE DATA AS A RESULT OF THESE OPTIONS. --no-journal-available This option allows reiserfsck to proceed when the journal device is not available. This option has no effect when the journal is located on the main data device. NOTE: after this operation you must use reiserfstune to specify a new journal device. --scan-whole-partition, -S This option causes --rebuild-tree to scan the whole partition but not only the used space on the partition. === AN EXAMPLE OF USING reiserfsck === 1. You think something may be wrong with a reiserfs partition on /dev/hda1 or you would just like to perform a periodic disk check. 2. Run reiserfsck --check --logfile check.log /dev/hda1. If reiserfsck --check exits with status 0 it means no errors were discovered. 3. If reiserfsck --check exits with status 1 (and reports about fixable corruptions) it means that you should run reiserfsck --fix-fixable --logfile fixable.log /dev/hda1. 4. If reiserfsck --check exits with status 2 (and reports about fatal corruptions) it means that you need to run reiserfsck --rebuild-tree. If reiserfsck --check fails in some way you should also run reiserfsck --rebuild-tree, but we also encourage you to submit this as a bug report. 5. Before running reiserfsck --rebuild-tree, please make a backup of the whole partition before proceeding. Then run reiserfsck --rebuild-tree --logfile rebuild.log /dev/hda1. 6. If the reiserfsck --rebuild-tree step fails or does not recover what you expected, please submit this as a bug report. Try to provide as much information as possible including your platform and Linux kernel version. We will try to help solve the problem. === EXIT CODES === reiserfsck uses the following exit codes: 0 - No errors. 1 - File system errors corrected. 2 - Reboot is needed. 4 - File system fatal errors left uncorrected, reiserfsck --rebuild-tree needs to be launched. 6 - File system fixable errors left uncorrected, reiserfsck --fix-fixable needs to be launched. 8 - Operational error. 16 - Usage or syntax error. === AUTHOR === This version of reiserfsck has been written by Vitaly Fertman <vitaly@namesys.com>. === BUGS === Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages, the logfile; check the syslog file for any related information. === TODO === Faster recovering, signal handling. === SEE ALSO === [[mkreiserfs|mkreiserfs(8)]], [[reiserfstune|reiserfstune(8)]], [[resize_reiserfs|resize_reiserfs(8)]], [[debugreiserfs|debugreiserfs(8)]] b20ecf46787bf1884505f45aa0772d4d1b1dc0bd 1328 2009-06-25T07:34:54Z Chris goe 2 http://web.archive.org/web/20061113154841/www.namesys.com/reiserfsck.html REISERFSCK NAME SYNOPSIS DESCRIPTION OPTIONS EXPERT OPTIONS AN EXAMPLE OF USING reiserfsck EXIT CODES AUTHOR BUGS TODO SEE ALSO NAME reiserfsck - The checking tool for the ReiserFS filesystem. SYNOPSIS reiserfsck [ -afprVy ] [ --rebuild-sb | --check | --fix-fixable | --rebuild-tree | --clean-attributes ] [ -j | --journal device ] [ -z | --adjust-size ] [ -n | --nolog ] [ -B | --badblocks file ] [ -l | --logfile file ] [ -q | --quiet ] [ -y | --yes ] [ -S | --scan-whole-partition ] [ --no-journal-available ] device DESCRIPTION Reiserfsck searches for a Reiserfs filesystem on a device, replays any necessary transactions, and either checks or repairs the file system. device is the special file corresponding to a device or to a partition (e.g /dev/hdXX for an IDE disk partition or /dev/sdXX for a SCSI disk partition). OPTIONS --rebuild-sb This option recovers the superblock on a Reiserfs partition. Normally you only need this option if mount reports "read_super_block: can't find a reiserfs file system" and you are sure that a Reiserfs file system is there. But remember that if you have used some partition editor program and now you cannot find a filesystem, probably something has gone wrong while repartitioning and the start of the partition has been changed. If so, instead of rebuilding the super block on a wrong place you should find the correct start of the partition first. --check This default action checks filesystem consistency and reports, but does not repair any corruption that it finds. This option may be used on a read-only file system mount. --fix-fixable This option recovers certain kinds of corruption that do not require rebuilding the entire file system tree (--rebuild-tree). Normally you only need this option if the --check option reports "corruption that can be fixed with --fix-fixable". This includes: zeroing invalid data-block pointers, correcting st_size and st_blocks for directories, and deleting invalid directory entries. --rebuild-tree This option rebuilds the entire filesystem tree using leaf nodes found on the device. Normally you only need this option if the reiserfsck --check reports "Running with --rebuild-tree is required". You are strongly encouraged to make a backup copy of the whole partition before attempting the --rebuild-tree option. Once reiserfsck --rebuild-tree is started it must finish its work (and you should not interrupt it), otherwise the filesystem will be left in the unmountable state to avoid subsequent data corruptions. --clean-attributes This option cleans reserved fields of Stat-Data items. There were days when there were no extended attributes in reiserfs. When they were implemented old partitions needed to be cleaned first -- reiserfs code in the kernel did not care about not used fields in its strutures. Thus if you have used one of the old (pre-attrbutes) kernels with a ReiserFS filesystem and you want to use extented attribues there, you should clean the filesystem first. --journal device , -j device This option supplies the device name of the current file system journal. This option is required when the journal resides on a separate device from the main data device (although it can be avoided with the expert option --no-journal-available). --adjust-size, -z This option causes reiserfsck to correct file sizes that are larger than the offset of the last discovered byte. This implies that holes at the end of a file will be removed. File sizes that are smaller than the offset of the last discovered byte are corrected by --fix-fixable. --badblocks file, -B file This option sets the badblock list to be the list of blocks specified in the given `file`. The filesystem badblock list is cleared before the new list is added. It can be used with --fix-fixable to fix the list of badblocks (see debugreiserfs -B). If the device has bad blocks, every time it must be given with the --rebuild-tree option. --logfile file, -l file This option causes reiserfsck to report any corruption it finds to the specified log file rather than to stderr. --nolog, -n This option prevents reiserfsck from reporting any kinds of corruption. --quiet, -q This option prevents reiserfsck from reporting its rate of progress. --yes, -y This option inhibits reiserfsck from asking you for confirmation after telling you what it is going to do. It will assuem you confirm. For safety, it does not work with the --rebuild-tree option. -a, -p These options are usually passed by fsck -A during the automatic checking of those partitions listed in /etc/fstab. These options cause reiserfsck to print some information about the specified filesystem, to check if error flags in the superblock are set and to do some light-weight checks. If these checks reveal a corruption or the flag indicating a (possibly fixable) corruption is found set in the superblock, then reiserfsck switches to the fix-fixable mode. If the flag indicating a fatal corruption is found set in the superblock, then reiserfsck finishes with an error. -V This option prints the reiserfsprogs version and then exit. -r, -f These options are not yet operational and therefore are ignored. EXPERT OPTIONS DO NOT USE THESE OPTIONS UNLESS YOU KNOW WHAT YOU ARE DOING. WE ARE NOT RESPONSIBLE IF YOU LOSE DATA AS A RESULT OF THESE OPTIONS. --no-journal-available This option allows reiserfsck to proceed when the journal device is not available. This option has no effect when the journal is located on the main data device. NOTE: after this operation you must use reiserfstune to specify a new journal device. --scan-whole-partition, -S This option causes --rebuild-tree to scan the whole partition but not only the used space on the partition. AN EXAMPLE OF USING reiserfsck 1. You think something may be wrong with a reiserfs partition on /dev/hda1 or you would just like to perform a periodic disk check. 2. Run reiserfsck --check --logfile check.log /dev/hda1. If reiserfsck --check exits with status 0 it means no errors were discovered. 3. If reiserfsck --check exits with status 1 (and reports about fixable corruptions) it means that you should run reiserfsck --fix-fixable --logfile fixable.log /dev/hda1. 4. If reiserfsck --check exits with status 2 (and reports about fatal corruptions) it means that you need to run reiserfsck --rebuild-tree. If reiserfsck --check fails in some way you should also run reiserfsck --rebuild-tree, but we also encourage you to submit this as a bug report. 5. Before running reiserfsck --rebuild-tree, please make a backup of the whole partition before proceeding. Then run reiserfsck --rebuild-tree --logfile rebuild.log /dev/hda1. 6. If the reiserfsck --rebuild-tree step fails or does not recover what you expected, please submit this as a bug report. Try to provide as much information as possible including your platform and Linux kernel version. We will try to help solve the problem. EXIT CODES reiserfsck uses the following exit codes: 0 - No errors. 1 - File system errors corrected. 2 - Reboot is needed. 4 - File system fatal errors left uncorrected, reiserfsck --rebuild-tree needs to be launched. 6 - File system fixable errors left uncorrected, reiserfsck --fix-fixable needs to be launched. 8 - Operational error. 16 - Usage or syntax error. AUTHOR This version of reiserfsck has been written by Vitaly Fertman <vitaly@namesys.com>. BUGS Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages, the logfile; check the syslog file for any related information. TODO Faster recovering, signal handling. SEE ALSO mkreiserfs(8), reiserfstune(8) resize_reiserfs(8), debugreiserfs(8), d24a183f639f3c90fbc02fb4b85cb9eb7d97868b Reiserfsprogs 0 46 4151 4076 2016-06-23T18:46:28Z Chris goe 2 updated to reiserfsprogs v3.6.25; missing dependencies added The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and should be shipped by most distributions. If it doesn't, we have to build it manually. Install prerequisites: sudo apt-get install uuid-dev libacl1-dev comerr-dev # Debian, Ubuntu sudo dnf install libuuid-devel libacl-devel libcom_err-devel # Fedora sudo zypper install libuuid-devel libacl-devel libcom_err-devel # openSUSE Get source: VER=<font color="red">3.6.25</font> wget https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/v$VER/reiserfsprogs-$VER.tar.{sign,xz} xz -d reiserfsprogs-$VER.tar.xz gpg --recv-keys [http://pgp.mit.edu:11371/pks/lookup?search=0x2179E5B2 2179E5B2] gpg --verify reiserfsprogs-$VER.tar.sign tar -xf reiserfsprogs-$VER.tar && cd reiserfsprogs-$VER Build & install: ./configure --prefix=/opt/reiserfsprogs && make # Prefix with ''CFLAGS="$CFLAGS -std=gnu89"'' for [http://wiki.linuxfromscratch.org/blfs/changeset/16320#file4 GCC-5] sudo make install Or, from the [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/ Git tree]: git clone git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/reiserfsprogs.git reiserfsprogs-git cd reiserfsprogs-git libtoolize --copy --install --force && aclocal && autoheader && autoconf && automake --add-missing ./configure --prefix=/opt/reiserfsprogs && make sudo make install [[category:ReiserFS]] 542c929a7868b33ab568d5d29c9e8089827ee8e6 4076 4075 2015-08-30T23:47:13Z Chris goe 2 Re: [blfs-dev] Reiserfs build error / https://www.mail-archive.com/blfs-dev@lists.linuxfromscratch.org/msg03341.html The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and should be shipped by most distributions. If it doesn't, we have to build it manually. Install prerequisites: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora sudo zypper install libuuid-devel # openSUSE Get source: VER=<font color="red">3.6.24</font> wget https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/v$VER/reiserfsprogs-$VER.tar.{sign,xz} xz -d reiserfsprogs-$VER.tar.xz gpg --recv-keys [http://pgp.mit.edu:11371/pks/lookup?search=0x2179E5B2 2179E5B2] gpg --verify reiserfsprogs-$VER.tar.sign tar -xf reiserfsprogs-$VER.tar cd reiserfsprogs-$VER Build & install: ./configure --prefix=/opt/reiserfsprogs && make # Prefix with ''CFLAGS="$CFLAGS -std=gnu89"'' for [http://wiki.linuxfromscratch.org/blfs/changeset/16320#file4 GCC-5] sudo make install Or, from the [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/ Git tree]: git clone git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/reiserfsprogs.git reiserfsprogs-git cd reiserfsprogs-git libtoolize --copy --install --force && aclocal && autoheader && autoconf && automake --add-missing ./configure --prefix=/opt/reiserfsprogs && make sudo make install [[category:ReiserFS]] 9200e6e1c15ab18a9c94e010b31aa1c013b512f5 4075 3571 2015-08-30T23:34:34Z Chris goe 2 use $VER so we only have to update this variable and then copy & paste the rest The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and should be shipped by most distributions. If it doesn't, we have to build it manually. Install prerequisites: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora sudo zypper install libuuid-devel # openSUSE Get source: VER=<font color="red">3.6.24</font> wget https://www.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/v$VER/reiserfsprogs-$VER.tar.{sign,xz} xz -d reiserfsprogs-$VER.tar.xz gpg --recv-keys [http://pgp.mit.edu:11371/pks/lookup?search=0x2179E5B2 2179E5B2] gpg --verify reiserfsprogs-$VER.tar.sign tar -xf reiserfsprogs-$VER.tar cd reiserfsprogs-$VER Build & install: ./configure --prefix=/opt/reiserfsprogs && make sudo make install Or, from the [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/ Git tree]: git clone git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/reiserfsprogs.git reiserfsprogs-git cd reiserfsprogs-git libtoolize --copy --install --force && aclocal && autoheader && autoconf && automake --add-missing ./configure --prefix=/opt/reiserfsprogs && make sudo make install [[category:ReiserFS]] a02bac1775e5abeb6ae6ebed814f46af597a980d 3571 3561 2013-07-03T17:22:29Z Chris goe 2 --add-missing was missing :-) The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and should be shipped by most distributions. If it doesn't, we have to build it manually. Install prerequisites: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora sudo zypper install libuuid-devel # openSUSE Get source: wget https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/v3.6.23/reiserfsprogs-3.6.23.tar.{sign,xz} xz -d reiserfsprogs*.tar.xz gpg --recv-keys [http://pgp.mit.edu:11371/pks/lookup?search=0x2179E5B2 2179E5B2] gpg --verify reiserfsprogs*.tar.sign tar -xf reiserfsprogs*.tar cd reiserfsprogs* Build & install: ./configure --prefix=/opt/reiserfsprogs && make sudo make install Or, from the [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/ Git tree]: git clone git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/reiserfsprogs.git reiserfsprogs-git cd reiserfsprogs-git libtoolize --copy --install --force && aclocal && autoheader && autoconf && automake --add-missing ./configure --prefix=/opt/reiserfsprogs && make sudo make install [[category:ReiserFS]] 0fe251f1b09bf0d0631641fbcc95fa1c128ee9cf 3561 3551 2013-07-03T17:19:00Z Chris goe 2 +git source (incomplete) The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and should be shipped by most distributions. If it doesn't, we have to build it manually. Install prerequisites: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora sudo zypper install libuuid-devel # openSUSE Get source: wget https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/v3.6.23/reiserfsprogs-3.6.23.tar.{sign,xz} xz -d reiserfsprogs*.tar.xz gpg --recv-keys [http://pgp.mit.edu:11371/pks/lookup?search=0x2179E5B2 2179E5B2] gpg --verify reiserfsprogs*.tar.sign tar -xf reiserfsprogs*.tar cd reiserfsprogs* Build & install: ./configure --prefix=/opt/reiserfsprogs && make sudo make install Or, from the [https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/ Git tree]: git clone git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/reiserfsprogs.git cd reiserfsprogs <s>libtoolize --copy --install --force && aclocal && autoheader && autoconf ./configure --prefix=/opt/reiserfsprogs && make sudo make install</s> -- Hm, we're missing something here [[category:ReiserFS]] 6ca90fa90dc8d52e309942bfe6371f5b84982cb4 3551 2581 2013-07-03T08:45:13Z Chris goe 2 reiserfsprogs 3.6.23 released The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and can be found on the [https://build.opensuse.org/package/files?package=reiserfs&project=filesystems openSUSE Build Service]. If your distribution does not ship a pre-compiled package, we have to build this manually: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora sudo zypper install libuuid-devel # openSUSE wget https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/v3.6.23/reiserfsprogs-3.6.23.tar.{sign,xz} xz -d reiserfsprogs*.tar.xz gpg --recv-keys [http://pgp.mit.edu:11371/pks/lookup?search=0x2179E5B2 2179E5B2] gpg --verify reiserfsprogs*.tar.sign tar -xf reiserfsprogs*.tar cd reiserfsprogs* ./configure --prefix=/opt/reiserfsprogs && make sudo make install [[category:ReiserFS]] ad5a16e549ffb8bd16767b59e5a3bc9a7f3d6edc 2581 2571 2012-10-11T02:58:25Z Chris goe 2 +zypper The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and can be found on the [https://build.opensuse.org/package/files?package=reiserfs&project=filesystems openSUSE Build Service]. If your distribution does not ship a pre-compiled package, we have to build this manually: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora sudo zypper install libuuid-devel # openSUSE wget https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/v3.6.21/reiserfsprogs-3.6.21.tar.{sign,xz} xz -d reiserfsprogs-3.6.21.tar.xz gpg --verify reiserfsprogs-3.6.21.tar.sign tar -xf reiserfsprogs-3.6.21.tar cd reiserfsprogs-3.6.21 ./configure --prefix=/opt/reiserfsprogs make && sudo make install [[category:ReiserFS]] 7da18b2338df74accd3000b155b3260dafc24a13 2571 2451 2012-10-11T02:56:25Z Chris goe 2 url updated The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and can be found on the [https://build.opensuse.org/package/files?package=reiserfs&project=filesystems openSUSE Build Service]. If your distribution does not ship a pre-compiled package, we have to build this manually: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora wget https://ftp.kernel.org/pub/linux/kernel/people/jeffm/reiserfsprogs/v3.6.21/reiserfsprogs-3.6.21.tar.{sign,xz} xz -d reiserfsprogs-3.6.21.tar.xz gpg --verify reiserfsprogs-3.6.21.tar.sign tar -xf reiserfsprogs-3.6.21.tar cd reiserfsprogs-3.6.21 ./configure --prefix=/opt/reiserfsprogs make && sudo make install [[category:ReiserFS]] 50b4636f4448638ec3de05e0b59eb1c4fbe91e03 2451 2431 2012-09-25T16:55:48Z Chris goe 2 openSUSE Build Service The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and can be found on the [https://build.opensuse.org/package/files?package=reiserfs&project=filesystems openSUSE Build Service]. If your distribution does not ship a pre-compiled package, we have to build this manually: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora wget https://api.opensuse.org/public/source/filesystems/reiserfs/reiserfsprogs-3.6.21.tar.bz2 tar -xjf reiserfsprogs-3.6.21.tar.bz2 cd reiserfsprogs-3.6.21 ./configure --prefix=/opt/reiserfsprogs make && sudo make install [[category:ReiserFS]] e47441d840ca6dc12e72b410ae809e3b10ebb592 2431 2421 2012-09-25T16:53:34Z Chris goe 2 s/z/j/ The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt>. If your distribution does not ship a pre-compiled package, we have to build this manually: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora wget https://api.opensuse.org/public/source/filesystems/reiserfs/reiserfsprogs-3.6.21.tar.bz2 tar -xjf reiserfsprogs-3.6.21.tar.bz2 cd reiserfsprogs-3.6.21 ./configure --prefix=/opt/reiserfsprogs make && sudo make install [[category:ReiserFS]] d6325ad0d8d3aad0fd30afdcd58e6cb0272b9a40 2421 1610 2012-09-25T16:52:46Z Chris goe 2 URLs updated The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt>. If your distribution does not ship a pre-compiled package, we have to build this manually: sudo apt-get install uuid-dev # Debian, Ubuntu sudo yum install libuuid-devel # Fedora wget https://api.opensuse.org/public/source/filesystems/reiserfs/reiserfsprogs-3.6.21.tar.bz2 tar -xzf reiserfsprogs-3.6.21.tar.bz2 cd reiserfsprogs-3.6.21 ./configure --prefix=/opt/reiserfsprogs make && sudo make install [[category:ReiserFS]] c7e35b8abb70c43865d8dc9579c8c93e3d9105d1 1610 1503 2009-07-06T07:26:55Z Chris goe 2 gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and can be found on [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ kernel.org]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This version contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). If <tt>reiserfsprogs</tt> is not already part of your distribution (unlikely, it should be available), you have to build your own: $ sudo apt-get install uuid-dev $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2.sign $ gpg --recv-keys [http://kernel.org/signature.html 517D0F0E] $ gpg --verify reiserfsprogs-3.6.21.tar.bz2.sign reiserfsprogs-3.6.21.tar.bz2 gpg: Signature made Sat Jan 10 16:15:04 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiserfsprogs-3.6.21.tar.bz2 $ cd reiserfsprogs-3.6.21 $ ./configure --prefix=/opt/reiserfsprogs && make && sudo make install [[category:ReiserFS]] eef0e3473c1698248d7fc9cb20a0e80423eb1646 1503 1458 2009-06-27T17:17:57Z Chris goe 2 uuid-dev is needed too The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and can be found on [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ kernel.org]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This version contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). If <tt>reiserfsprogs</tt> is not already part of your distribution (unlikely, it should be available), you have to build your own: <pre> $ sudo apt-get install uuid-dev $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2.sign $ gpg --verify reiserfsprogs-3.6.21.tar.bz2.sign reiserfsprogs-3.6.21.tar.bz2 gpg: Signature made Sat Jan 10 16:15:04 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiserfsprogs-3.6.21.tar.bz2 $ cd reiserfsprogs-3.6.21 $ ./configure --prefix=/opt/reiserfsprogs && make && sudo make install </pre> [[category:ReiserFS]] 56509aedcd09a17dbc5b64d830e943b1e4758142 1458 1456 2009-06-27T03:58:16Z Chris goe 2 wording The tools to maintain a ReiserFS (Reiser v3) filesystem are called <tt>reiserfsprogs</tt> and can be found on [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ kernel.org]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This version contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). If <tt>reiserfsprogs</tt> is not already part of your distribution (unlikely, it should be available), we have to build our own: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2.sign $ gpg --verify reiserfsprogs-3.6.21.tar.bz2.sign reiserfsprogs-3.6.21.tar.bz2 gpg: Signature made Sat Jan 10 16:15:04 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiserfsprogs-3.6.21.tar.bz2 $ cd reiserfsprogs-3.6.21 $ ./configure --prefix=/opt/reiserfsprogs && make && sudo make install </pre> [[category:ReiserFS]] d12c3e1310dede7b4cbe78cd4c8a09cc486b7e7f 1456 2009-06-27T03:55:48Z Chris goe 2 reiserfsprogs split The tools to maintain a ReiserFS (Reiser v3) are called <tt>reiserfsprogs</tt> and can be found [http://www.kernel.org/pub/linux/utils/fs/reiserfs/ here]. NOTE: [http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 Reiserfsprogs-3.6.21] is the current version and considered ''stable''. This contains changes made by Jeff Mahoney (everything got testing as a part of latest SuSE distros). If <tt>reiserfsprogs</tt> is not already part of your distribution (unlikely, it should be available), we have to build our own: <pre> $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2 $ wget http://www.kernel.org/pub/linux/utils/fs/reiserfs/reiserfsprogs-3.6.21.tar.bz2.sign $ gpg --verify reiserfsprogs-3.6.21.tar.bz2.sign reiserfsprogs-3.6.21.tar.bz2 gpg: Signature made Sat Jan 10 16:15:04 2009 CET using DSA key ID 517D0F0E gpg: Good signature from "Linux Kernel Archives Verification Key <ftpadmin@kernel.org>" $ tar -xjf reiserfsprogs-3.6.21.tar.bz2 $ cd reiserfsprogs-3.6.21 $ ./configure --prefix=/opt/reiserfsprogs && make && sudo make install </pre> [[category:ReiserFS]] 3211d6543e3a1f3f38d7de77bcf4c8bfb9ac54e9 Reiserfstune 0 26 1644 1536 2009-11-18T20:19:24Z Chris goe 2 s/POSSIBLE SCENARIOS OF USING REISERFSTUNE:/EXAMPLES/ === NAME === reiserfstune - The tuning tool for the [[ReiserFS]] filesystem. === SYNOPSIS === reiserfstune [ -f ] [ -j | --journal-device FILE ] [ --no-journal-available ] [ --journal-new-device FILE ] [ --make-journal-standard ] [ -s | --journal-new-size N ] [ -o | --journal-new-offset N ] [ -t | --trans-max-size N ] [ -b | --add-badblocks file ] [ -B | --badblocks file ] [ -u | --uuid UUID ] [ -l | --label LABEL ] ''device'' === DESCRIPTION === <tt>reiserfstune</tt> is used for tuning the ReiserFS. It can change two journal parameters (the journal size and the maximum transaction size), and it can move the journal's location to a new specified block device. (The old ReiserFS's journal may be kept unused, or discarded at the user's option.) Besides that reiserfstune can store the bad block list to the ReiserFS and set UUID and LABEL. Note: At the time of writing the relocated journal was implemented for a special release of ReiserFS, and was not expected to be put into the mainstream kernel until approximately Linux 2.5. This means that if you have the stock kernel you must apply a special patch. Without this patch the kernel will refuse to mount the newly modified file system. We will charge $25 to explain this to you if you ask us why it doesn't work. Perhaps the most interesting application of this code is to put the journal on a solid state disk. ''device'' is the special file corresponding to the newly specified block device (e.g /dev/hdXX for IDE disk partition or /dev/sdXX for the SCSI disk partition). === OPTIONS === -j | --journal-device FILE FILE is the file name of the block device the file system has the current journal (the one prior to running reiserfstune) on. This option is required when the journal is already on a separate device from the main data device (although it can be avoided with --no-journal-available). If you don't specify journal device by this option, reiserfstune suppose that journal is on main device. --no-journal-available allows reiserfstune to continue when the current journal's block device is no longer available. This might happen if a disk goes bad and you remove it (and run fsck). --journal-new-device FILE FILE is the file name of the block device which will contain the new journal for the file system. If you don't specify this, reiserfstune supposes that journal device remains the same. -s | --journal-new-size N N is the size parameter for the new journal. When journal is to be on a separate device - its size defaults to number of blocks that device has. When journal is to be on the same device as the filesytem - its size defaults to amount of blocks allocated for journal by [[mkreiserfs]] when it created the filesystem. Minimum is 513 for both cases. -o | --journal-new-offset N N is an offset in blocks where journal will starts from when journal is to be on a separate device. Default is 0. Has no effect when journal is to be on the same device as the filesystem. Most users have no need to use this feature. It can be used when you want the journals from multiple filesystems to reside on the same device, and you don't want to or cannot partition that device. -t | --trans-max-size N N is the maximum transaction size parameter for the new journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specifed incorrectly, it will be adjusted. -b | --add-badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The list is added to the fs list of bad blocks. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The bad block list on the fs is cleared before the list specified in the File is added to the fs. -f | --force Normally reiserfstune will refuse to change a journal of a file system that was created before this journal relocation code. This is because if you change the journal, you cannot go back (without special option --make-journal-standard) to an old kernel that lacks this feature and be able to use your filesytem. This option forces it to do that. Specified more than once it allows to avoid asking for confirmation. --make-journal-standard As it was mentioned above, if your file system has non-standard journal, it can not be mounted on the kernel without journal relocation code. The thing can be changed, the only condition is that there is reserved area on main device of the standard journal size 8193 blocks (it will be so for instance if you convert standard journal to non-standard). Just specify this option when you relocate journal back, or without relocation if you already have it on main device. -u | --uuid UUID Set the universally unique identifier (UUID) of the filesystem to UUID (see also [http://manpages.ubuntu.com/manpages/karmic/en/man1/uuidgen.1.html uuidgen(8)]). The format of the UUID is a series of hex digits separated by hyphens, like this: "c1b9d5a2-f162-11cf-9ece-0020afc76f16". -l | --label LABEL Set the volume label of the filesystem. LABEL can be at most 16 characters long; if it is longer than 16 characters, reiserfstune will truncate it. === EXAMPLES === * You have ReiserFS on /dev/hda1, and you wish to have it working with its journal on the device /dev/journal ** boot kernel patched with special "relocatable journal support" patch ** <tt>reiserfstune /dev/hda1 --journal-new-device /dev/journal -f</tt> ** <tt>mount /dev/hda1</tt> and use. * You would like to change max transaction size to 512 blocks ** <tt>reiserfstune -t 512 /dev/hda1</tt> * You would like to use your file system on another kernel that doesn't contain relocatable journal support. ** <tt>umount /dev/hda1</tt> ** <tt>reiserfstune /dev/hda1 -j /dev/journal --journal-new-device /dev/hda1 --make-journal-standard</tt> * You would like to have ReiserFS on /dev/hda1 and to be able to switch between different journals including journal located on the device containing the filesystem. ** boot kernel patched with special "relocatable journal support" patch ** <tt>mkreiserfs /dev/hda1</tt> * You got solid state disk (perhaps /dev/sda, they typically look like scsi disks) ** <tt>reiserfstune --journal-new-device /dev/sda1 -f /dev/hda1</tt> * Your scsi device dies, it is three in the morning, you have an extra IDE device lying around ** <tt>reiserfsck --no-journal-available /dev/hda1</tt> or ** <tt>reiserfsck --rebuild-tree --no-journal-available /dev/hda1</tt> ** <tt>reiserfstune --no-journal-available --journal-new-device /dev/hda1 /dev/hda1</tt> === AUTHOR === This version of reiserfstune has been written by Vladimir Demidov and Edward Shishkin <edward.shishkin@gmail.com>. === BUGS === Please report bugs to the ReiserFS developers {{Listaddress}}, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] 5a3e47638852636229d1be5c1247f0834b814b4c 1536 1534 2009-06-28T18:32:42Z Chris goe 2 listaddress added === NAME === reiserfstune - The tuning tool for the [[ReiserFS]] filesystem. === SYNOPSIS === reiserfstune [ -f ] [ -j | --journal-device FILE ] [ --no-journal-available ] [ --journal-new-device FILE ] [ --make-journal-standard ] [ -s | --journal-new-size N ] [ -o | --journal-new-offset N ] [ -t | --trans-max-size N ] [ -b | --add-badblocks file ] [ -B | --badblocks file ] [ -u | --uuid UUID ] [ -l | --label LABEL ] ''device'' === DESCRIPTION === <tt>reiserfstune</tt> is used for tuning the ReiserFS. It can change two journal parameters (the journal size and the maximum transaction size), and it can move the journal's location to a new specified block device. (The old ReiserFS's journal may be kept unused, or discarded at the user's option.) Besides that reiserfstune can store the bad block list to the ReiserFS and set UUID and LABEL. Note: At the time of writing the relocated journal was implemented for a special release of ReiserFS, and was not expected to be put into the mainstream kernel until approximately Linux 2.5. This means that if you have the stock kernel you must apply a special patch. Without this patch the kernel will refuse to mount the newly modified file system. We will charge $25 to explain this to you if you ask us why it doesn't work. Perhaps the most interesting application of this code is to put the journal on a solid state disk. ''device'' is the special file corresponding to the newly specified block device (e.g /dev/hdXX for IDE disk partition or /dev/sdXX for the SCSI disk partition). === OPTIONS === -j | --journal-device FILE FILE is the file name of the block device the file system has the current journal (the one prior to running reiserfstune) on. This option is required when the journal is already on a separate device from the main data device (although it can be avoided with --no-journal-available). If you don't specify journal device by this option, reiserfstune suppose that journal is on main device. --no-journal-available allows reiserfstune to continue when the current journal's block device is no longer available. This might happen if a disk goes bad and you remove it (and run fsck). --journal-new-device FILE FILE is the file name of the block device which will contain the new journal for the file system. If you don't specify this, reiserfstune supposes that journal device remains the same. -s | --journal-new-size N N is the size parameter for the new journal. When journal is to be on a separate device - its size defaults to number of blocks that device has. When journal is to be on the same device as the filesytem - its size defaults to amount of blocks allocated for journal by [[mkreiserfs]] when it created the filesystem. Minimum is 513 for both cases. -o | --journal-new-offset N N is an offset in blocks where journal will starts from when journal is to be on a separate device. Default is 0. Has no effect when journal is to be on the same device as the filesystem. Most users have no need to use this feature. It can be used when you want the journals from multiple filesystems to reside on the same device, and you don't want to or cannot partition that device. -t | --trans-max-size N N is the maximum transaction size parameter for the new journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specifed incorrectly, it will be adjusted. -b | --add-badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The list is added to the fs list of bad blocks. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The bad block list on the fs is cleared before the list specified in the File is added to the fs. -f | --force Normally reiserfstune will refuse to change a journal of a file system that was created before this journal relocation code. This is because if you change the journal, you cannot go back (without special option --make-journal-standard) to an old kernel that lacks this feature and be able to use your filesytem. This option forces it to do that. Specified more than once it allows to avoid asking for confirmation. --make-journal-standard As it was mentioned above, if your file system has non-standard journal, it can not be mounted on the kernel without journal relocation code. The thing can be changed, the only condition is that there is reserved area on main device of the standard journal size 8193 blocks (it will be so for instance if you convert standard journal to non-standard). Just specify this option when you relocate journal back, or without relocation if you already have it on main device. -u | --uuid UUID Set the universally unique identifier (UUID) of the filesystem to UUID (see also [http://manpages.ubuntu.com/manpages/karmic/en/man1/uuidgen.1.html uuidgen(8)]). The format of the UUID is a series of hex digits separated by hyphens, like this: "c1b9d5a2-f162-11cf-9ece-0020afc76f16". -l | --label LABEL Set the volume label of the filesystem. LABEL can be at most 16 characters long; if it is longer than 16 characters, reiserfstune will truncate it. === POSSIBLE SCENARIOS OF USING REISERFSTUNE: === * You have ReiserFS on /dev/hda1, and you wish to have it working with its journal on the device /dev/journal ** boot kernel patched with special "relocatable journal support" patch ** <tt>reiserfstune /dev/hda1 --journal-new-device /dev/journal -f</tt> ** <tt>mount /dev/hda1</tt> and use. * You would like to change max transaction size to 512 blocks ** <tt>reiserfstune -t 512 /dev/hda1</tt> * You would like to use your file system on another kernel that doesn't contain relocatable journal support. ** <tt>umount /dev/hda1</tt> ** <tt>reiserfstune /dev/hda1 -j /dev/journal --journal-new-device /dev/hda1 --make-journal-standard</tt> * You would like to have ReiserFS on /dev/hda1 and to be able to switch between different journals including journal located on the device containing the filesystem. ** boot kernel patched with special "relocatable journal support" patch ** <tt>mkreiserfs /dev/hda1</tt> * You got solid state disk (perhaps /dev/sda, they typically look like scsi disks) ** <tt>reiserfstune --journal-new-device /dev/sda1 -f /dev/hda1</tt> * Your scsi device dies, it is three in the morning, you have an extra IDE device lying around ** <tt>reiserfsck --no-journal-available /dev/hda1</tt> or ** <tt>reiserfsck --rebuild-tree --no-journal-available /dev/hda1</tt> ** <tt>reiserfstune --no-journal-available --journal-new-device /dev/hda1 /dev/hda1</tt> === AUTHOR === This version of reiserfstune has been written by Vladimir Demidov and Edward Shishkin <edward.shishkin@gmail.com>. === BUGS === Please report bugs to the ReiserFS developers {{Listaddress}}, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] 31a7d45995b6b4ffda323f14b6be8eec07f62d30 1534 1533 2009-06-28T18:22:59Z Chris goe 2 -mount === NAME === reiserfstune - The tuning tool for the [[ReiserFS]] filesystem. === SYNOPSIS === reiserfstune [ -f ] [ -j | --journal-device FILE ] [ --no-journal-available ] [ --journal-new-device FILE ] [ --make-journal-standard ] [ -s | --journal-new-size N ] [ -o | --journal-new-offset N ] [ -t | --trans-max-size N ] [ -b | --add-badblocks file ] [ -B | --badblocks file ] [ -u | --uuid UUID ] [ -l | --label LABEL ] ''device'' === DESCRIPTION === <tt>reiserfstune</tt> is used for tuning the ReiserFS. It can change two journal parameters (the journal size and the maximum transaction size), and it can move the journal's location to a new specified block device. (The old ReiserFS's journal may be kept unused, or discarded at the user's option.) Besides that reiserfstune can store the bad block list to the ReiserFS and set UUID and LABEL. Note: At the time of writing the relocated journal was implemented for a special release of ReiserFS, and was not expected to be put into the mainstream kernel until approximately Linux 2.5. This means that if you have the stock kernel you must apply a special patch. Without this patch the kernel will refuse to mount the newly modified file system. We will charge $25 to explain this to you if you ask us why it doesn't work. Perhaps the most interesting application of this code is to put the journal on a solid state disk. ''device'' is the special file corresponding to the newly specified block device (e.g /dev/hdXX for IDE disk partition or /dev/sdXX for the SCSI disk partition). === OPTIONS === -j | --journal-device FILE FILE is the file name of the block device the file system has the current journal (the one prior to running reiserfstune) on. This option is required when the journal is already on a separate device from the main data device (although it can be avoided with --no-journal-available). If you don't specify journal device by this option, reiserfstune suppose that journal is on main device. --no-journal-available allows reiserfstune to continue when the current journal's block device is no longer available. This might happen if a disk goes bad and you remove it (and run fsck). --journal-new-device FILE FILE is the file name of the block device which will contain the new journal for the file system. If you don't specify this, reiserfstune supposes that journal device remains the same. -s | --journal-new-size N N is the size parameter for the new journal. When journal is to be on a separate device - its size defaults to number of blocks that device has. When journal is to be on the same device as the filesytem - its size defaults to amount of blocks allocated for journal by [[mkreiserfs]] when it created the filesystem. Minimum is 513 for both cases. -o | --journal-new-offset N N is an offset in blocks where journal will starts from when journal is to be on a separate device. Default is 0. Has no effect when journal is to be on the same device as the filesystem. Most users have no need to use this feature. It can be used when you want the journals from multiple filesystems to reside on the same device, and you don't want to or cannot partition that device. -t | --trans-max-size N N is the maximum transaction size parameter for the new journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specifed incorrectly, it will be adjusted. -b | --add-badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The list is added to the fs list of bad blocks. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The bad block list on the fs is cleared before the list specified in the File is added to the fs. -f | --force Normally reiserfstune will refuse to change a journal of a file system that was created before this journal relocation code. This is because if you change the journal, you cannot go back (without special option --make-journal-standard) to an old kernel that lacks this feature and be able to use your filesytem. This option forces it to do that. Specified more than once it allows to avoid asking for confirmation. --make-journal-standard As it was mentioned above, if your file system has non-standard journal, it can not be mounted on the kernel without journal relocation code. The thing can be changed, the only condition is that there is reserved area on main device of the standard journal size 8193 blocks (it will be so for instance if you convert standard journal to non-standard). Just specify this option when you relocate journal back, or without relocation if you already have it on main device. -u | --uuid UUID Set the universally unique identifier (UUID) of the filesystem to UUID (see also [http://manpages.ubuntu.com/manpages/karmic/en/man1/uuidgen.1.html uuidgen(8)]). The format of the UUID is a series of hex digits separated by hyphens, like this: "c1b9d5a2-f162-11cf-9ece-0020afc76f16". -l | --label LABEL Set the volume label of the filesystem. LABEL can be at most 16 characters long; if it is longer than 16 characters, reiserfstune will truncate it. === POSSIBLE SCENARIOS OF USING REISERFSTUNE: === * You have ReiserFS on /dev/hda1, and you wish to have it working with its journal on the device /dev/journal ** boot kernel patched with special "relocatable journal support" patch ** <tt>reiserfstune /dev/hda1 --journal-new-device /dev/journal -f</tt> ** <tt>mount /dev/hda1</tt> and use. * You would like to change max transaction size to 512 blocks ** <tt>reiserfstune -t 512 /dev/hda1</tt> * You would like to use your file system on another kernel that doesn't contain relocatable journal support. ** <tt>umount /dev/hda1</tt> ** <tt>reiserfstune /dev/hda1 -j /dev/journal --journal-new-device /dev/hda1 --make-journal-standard</tt> * You would like to have ReiserFS on /dev/hda1 and to be able to switch between different journals including journal located on the device containing the filesystem. ** boot kernel patched with special "relocatable journal support" patch ** <tt>mkreiserfs /dev/hda1</tt> * You got solid state disk (perhaps /dev/sda, they typically look like scsi disks) ** <tt>reiserfstune --journal-new-device /dev/sda1 -f /dev/hda1</tt> * Your scsi device dies, it is three in the morning, you have an extra IDE device lying around ** <tt>reiserfsck --no-journal-available /dev/hda1</tt> or ** <tt>reiserfsck --rebuild-tree --no-journal-available /dev/hda1</tt> ** <tt>reiserfstune --no-journal-available --journal-new-device /dev/hda1 /dev/hda1</tt> === AUTHOR === This version of reiserfstune has been written by Vladimir Demidov and Edward Shishkin <edward.shishkin@gmail.com>. === BUGS === Please [[mailinglists|report bugs to the ReiserFS developers]], providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] 1454cc8dde50f9ad41529775e5b65aebe59ad01c 1533 1334 2009-06-28T08:08:13Z Chris goe 2 formatting fixes === NAME === reiserfstune - The tuning tool for the [[ReiserFS]] filesystem. === SYNOPSIS === reiserfstune [ -f ] [ -j | --journal-device FILE ] [ --no-journal-available ] [ --journal-new-device FILE ] [ --make-journal-standard ] [ -s | --journal-new-size N ] [ -o | --journal-new-offset N ] [ -t | --trans-max-size N ] [ -b | --add-badblocks file ] [ -B | --badblocks file ] [ -u | --uuid UUID ] [ -l | --label LABEL ] ''device'' === DESCRIPTION === <tt>reiserfstune</tt> is used for tuning the ReiserFS. It can change two journal parameters (the journal size and the maximum transaction size), and it can move the journal's location to a new specified block device. (The old ReiserFS's journal may be kept unused, or discarded at the user's option.) Besides that reiserfstune can store the bad block list to the ReiserFS and set UUID and LABEL. Note: At the time of writing the relocated journal was implemented for a special release of ReiserFS, and was not expected to be put into the mainstream kernel until approximately Linux 2.5. This means that if you have the stock kernel you must apply a special patch. Without this patch the kernel will refuse to mount the newly modified file system. We will charge $25 to explain this to you if you ask us why it doesn't work. Perhaps the most interesting application of this code is to put the journal on a solid state disk. ''device'' is the special file corresponding to the newly specified block device (e.g /dev/hdXX for IDE disk partition or /dev/sdXX for the SCSI disk partition). === OPTIONS === -j | --journal-device FILE FILE is the file name of the block device the file system has the current journal (the one prior to running reiserfstune) on. This option is required when the journal is already on a separate device from the main data device (although it can be avoided with --no-journal-available). If you don't specify journal device by this option, reiserfstune suppose that journal is on main device. --no-journal-available allows reiserfstune to continue when the current journal's block device is no longer available. This might happen if a disk goes bad and you remove it (and run fsck). --journal-new-device FILE FILE is the file name of the block device which will contain the new journal for the file system. If you don't specify this, reiserfstune supposes that journal device remains the same. -s | --journal-new-size N N is the size parameter for the new journal. When journal is to be on a separate device - its size defaults to number of blocks that device has. When journal is to be on the same device as the filesytem - its size defaults to amount of blocks allocated for journal by [[mkreiserfs]] when it created the filesystem. Minimum is 513 for both cases. -o | --journal-new-offset N N is an offset in blocks where journal will starts from when journal is to be on a separate device. Default is 0. Has no effect when journal is to be on the same device as the filesystem. Most users have no need to use this feature. It can be used when you want the journals from multiple filesystems to reside on the same device, and you don't want to or cannot partition that device. -t | --trans-max-size N N is the maximum transaction size parameter for the new journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specifed incorrectly, it will be adjusted. -b | --add-badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The list is added to the fs list of bad blocks. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The bad block list on the fs is cleared before the list specified in the File is added to the fs. -f | --force Normally reiserfstune will refuse to change a journal of a file system that was created before this journal relocation code. This is because if you change the journal, you cannot go back (without special option --make-journal-standard) to an old kernel that lacks this feature and be able to use your filesytem. This option forces it to do that. Specified more than once it allows to avoid asking for confirmation. --make-journal-standard As it was mentioned above, if your file system has non-standard journal, it can not be mounted on the kernel without journal relocation code. The thing can be changed, the only condition is that there is reserved area on main device of the standard journal size 8193 blocks (it will be so for instance if you convert standard journal to non-standard). Just specify this option when you relocate journal back, or without relocation if you already have it on main device. -u | --uuid UUID Set the universally unique identifier (UUID) of the filesystem to UUID (see also [http://manpages.ubuntu.com/manpages/karmic/en/man1/uuidgen.1.html uuidgen(8)]). The format of the UUID is a series of hex digits separated by hyphens, like this: "c1b9d5a2-f162-11cf-9ece-0020afc76f16". -l | --label LABEL Set the volume label of the filesystem. LABEL can be at most 16 characters long; if it is longer than 16 characters, reiserfstune will truncate it. === POSSIBLE SCENARIOS OF USING REISERFSTUNE: === * You have ReiserFS on /dev/hda1, and you wish to have it working with its journal on the device /dev/journal ** boot kernel patched with special "relocatable journal support" patch ** <tt>reiserfstune /dev/hda1 --journal-new-device /dev/journal -f</tt> ** <tt>mount /dev/hda1</tt> and use. * You would like to change max transaction size to 512 blocks ** <tt>reiserfstune -t 512 /dev/hda1</tt> * You would like to use your file system on another kernel that doesn't contain relocatable journal support. ** <tt>umount /dev/hda1</tt> ** <tt>reiserfstune /dev/hda1 -j /dev/journal --journal-new-device /dev/hda1 --make-journal-standard</tt> ** <tt>mount /dev/hda1 and use.</tt> * You would like to have ReiserFS on /dev/hda1 and to be able to switch between different journals including journal located on the device containing the filesystem. ** boot kernel patched with special "relocatable journal support" patch ** <tt>mkreiserfs /dev/hda1</tt> * You got solid state disk (perhaps /dev/sda, they typically look like scsi disks) ** <tt>reiserfstune --journal-new-device /dev/sda1 -f /dev/hda1</tt> * Your scsi device dies, it is three in the morning, you have an extra IDE device lying around ** <tt>reiserfsck --no-journal-available /dev/hda1</tt> or ** <tt>reiserfsck --rebuild-tree --no-journal-available /dev/hda1</tt> ** <tt>reiserfstune --no-journal-available --journal-new-device /dev/hda1 /dev/hda1</tt> === AUTHOR === This version of reiserfstune has been written by Vladimir Demidov and Edward Shishkin <edward.shishkin@gmail.com>. === BUGS === Please [[mailinglists|report bugs to the ReiserFS developers]], providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] 4284c2adf661266be9b0858cf820c3a10604de30 1334 1333 2009-06-25T07:52:48Z Chris goe 2 category added === NAME === reiserfstune - The tunning tool for the ReiserFS filesystem. === SYNOPSIS === reiserfstune [ -f ] [ -j | --journal-device FILE ] [ --no-journal-available ] [ --journal-new-device FILE ] [ --make-journal-standard ] [ -s | --journal-new-size N ] [ -o | --journal-new-offset N ] [ -t | --trans-max-size N ] [ -b | --add-badblocks file ] [ -B | --badblocks file ] [ -u | --uuid UUID ] [ -l | --label LABEL ] device === DESCRIPTION === reiserfstune is used for tuning the ReiserFS. It can change two journal parameters (the journal size and the maximum transaction size), and it can move the journal's location to a new specified block device. (The old ReiserFS's journal may be kept unused, or discarded at the user's option.) Besides that reiserfstune can store the bad block list to the ReiserFS and set UUID and LABEL. Note: At the time of writing the relocated journal was implemented for a special release of ReiserFS, and was not expected to be put into the mainstream kernel until approximately Linux 2.5. This means that if you have the stock kernel you must apply a special patch. Without this patch the kernel will refuse to mount the newly modified file system. We will charge $25 to explain this to you if you ask us why it doesn't work. Perhaps the most interesting application of this code is to put the journal on a solid state disk. device is the special file corresponding to the newly specified block device (e.g /dev/hdXX for IDE disk partition or /dev/sdXX for the SCSI disk partition). === OPTIONS === -j | --journal-device FILE FILE is the file name of the block device the file system has the current journal (the one prior to running reiserfstune) on. This option is required when the journal is already on a separate device from the main data device (although it can be avoided with --no-journal-available). If you don't specify journal device by this option, reiserfstune suppose that journal is on main device. --no-journal-available allows reiserfstune to continue when the current journal's block device is no longer available. This might happen if a disk goes bad and you remove it (and run fsck). --journal-new-device FILE FILE is the file name of the block device which will contain the new journal for the file system. If you don't specify this, reiserfstune supposes that journal device remains the same. -s | --journal-new-size N N is the size parameter for the new journal. When journal is to be on a separate device - its size defaults to number of blocks that device has. When journal is to be on the same device as the filesytem - its size defaults to amount of blocks allocated for journal by mkreiserfs when it created the filesystem. Minimum is 513 for both cases. -o | --journal-new-offset N N is an offset in blocks where journal will starts from when journal is to be on a separate device. Default is 0. Has no effect when journal is to be on the same device as the filesystem. Most users have no need to use this feature. It can be used when you want the journals from multiple filesystems to reside on the same device, and you don't want to or cannot partition that device. -t | --trans-max-size N N is the maximum transaction size parameter for the new journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specifed incorrectly, it will be adjusted. -b | --add-badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The list is added to the fs list of bad blocks. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The bad block list on the fs is cleared before the list specified in the File is added to the fs. -f | --force Normally reiserfstune will refuse to change a journal of a file system that was created before this journal relocation code. This is because if you change the journal, you cannot go back (without special option --make-journal-standard) to an old kernel that lacks this feature and be able to use your filesytem. This option forces it to do that. Specified more than once it allows to avoid asking for confirmation. --make-journal-standard As it was mentioned above, if your file system has non-standard journal, it can not be mounted on the kernel without journal relocation code. The thing can be changed, the only condition is that there is reserved area on main device of the standard journal size 8193 blocks (it will be so for instance if you convert standard journal to non-standard). Just specify this option when you relocate journal back, or without relocation if you already have it on main device. -u | --uuid UUID Set the universally unique identifier ( UUID ) of the filesystem to UUID (see also uuidgen(8)). The format of the UUID is a series of hex digits separated by hypthens, like this: "c1b9d5a2-f162-11cf-9ece-0020afc76f16". -l | --label LABEL Set the volume label of the filesystem. LABEL can be at most 16 characters long; if it is longer than 16 characters, reiserfstune will truncate it. === POSSIBLE SCENARIOS OF USING REISERFSTUNE: === 1. You have ReiserFS on /dev/hda1, and you wish to have it working with its journal on the device /dev/journal boot kernel patched with special "relocatable journal support" patch reiserfstune /dev/hda1 --journal-new-device /dev/journal -f mount /dev/hda1 and use. You would like to change max transaction size to 512 blocks reiserfstune -t 512 /dev/hda1 You would like to use your file system on another kernel that doesn't contain relocatable journal support. umount /dev/hda1 reiserfstune /dev/hda1 -j /dev/journal --journal-new-device /dev/hda1 --make-journal-standard mount /dev/hda1 and use. 2. You would like to have ReiserFS on /dev/hda1 and to be able to switch between different journals including journal located on the device containing the filesystem. boot kernel patched with special "relocatable journal support" patch mkreiserfs /dev/hda1 you got solid state disk (perhaps /dev/sda, they typically look like scsi disks) reiserfstune --journal-new-device /dev/sda1 -f /dev/hda1 Your scsi device dies, it is three in the morning, you have an extra IDE device lying around reiserfsck --no-journal-available /dev/hda1 or reiserfsck --rebuild-tree --no-journal-available /dev/hda1 reiserfstune --no-journal-available --journal-new-device /dev/hda1 /dev/hda1 using /dev/hda1 under patched kernel === AUTHOR === This version of reiserfstune has been written by Vladimir Demidov <vova@namesys.com> and Edward Shishkin <edward@namesys.com>. === BUGS === Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === [[reiserfsck|reiserfsck(8)]], [[debugreiserfs|debugreiserfs(8)]], [[mkreiserfs|mkreiserfs(8)]] [[category:ReiserFS]] 19a9b846b1c7961e572340ffc3cd82aa5bc50b21 1333 2009-06-25T07:51:22Z Chris goe 2 http://web.archive.org/web/20061113154832/www.namesys.com/reiserfstune.html REISERFSTUNE NAME SYNOPSIS DESCRIPTION OPTIONS POSSIBLE SCENARIOS OF USING REISERFSTUNE: AUTHOR BUGS SEE ALSO NAME reiserfstune - The tunning tool for the ReiserFS filesystem. SYNOPSIS reiserfstune [ -f ] [ -j | --journal-device FILE ] [ --no-journal-available ] [ --journal-new-device FILE ] [ --make-journal-standard ] [ -s | --journal-new-size N ] [ -o | --journal-new-offset N ] [ -t | --trans-max-size N ] [ -b | --add-badblocks file ] [ -B | --badblocks file ] [ -u | --uuid UUID ] [ -l | --label LABEL ] device DESCRIPTION reiserfstune is used for tuning the ReiserFS. It can change two journal parameters (the journal size and the maximum transaction size), and it can move the journal's location to a new specified block device. (The old ReiserFS's journal may be kept unused, or discarded at the user's option.) Besides that reiserfstune can store the bad block list to the ReiserFS and set UUID and LABEL. Note: At the time of writing the relocated journal was implemented for a special release of ReiserFS, and was not expected to be put into the mainstream kernel until approximately Linux 2.5. This means that if you have the stock kernel you must apply a special patch. Without this patch the kernel will refuse to mount the newly modified file system. We will charge $25 to explain this to you if you ask us why it doesn't work. Perhaps the most interesting application of this code is to put the journal on a solid state disk. device is the special file corresponding to the newly specified block device (e.g /dev/hdXX for IDE disk partition or /dev/sdXX for the SCSI disk partition). OPTIONS -j | --journal-device FILE FILE is the file name of the block device the file system has the current journal (the one prior to running reiserfstune) on. This option is required when the journal is already on a separate device from the main data device (although it can be avoided with --no-journal-available). If you don't specify journal device by this option, reiserfstune suppose that journal is on main device. --no-journal-available allows reiserfstune to continue when the current journal's block device is no longer available. This might happen if a disk goes bad and you remove it (and run fsck). --journal-new-device FILE FILE is the file name of the block device which will contain the new journal for the file system. If you don't specify this, reiserfstune supposes that journal device remains the same. -s | --journal-new-size N N is the size parameter for the new journal. When journal is to be on a separate device - its size defaults to number of blocks that device has. When journal is to be on the same device as the filesytem - its size defaults to amount of blocks allocated for journal by mkreiserfs when it created the filesystem. Minimum is 513 for both cases. -o | --journal-new-offset N N is an offset in blocks where journal will starts from when journal is to be on a separate device. Default is 0. Has no effect when journal is to be on the same device as the filesystem. Most users have no need to use this feature. It can be used when you want the journals from multiple filesystems to reside on the same device, and you don't want to or cannot partition that device. -t | --trans-max-size N N is the maximum transaction size parameter for the new journal. The default, and max possible, value is 1024 blocks. It should be less than half the size of the journal. If specifed incorrectly, it will be adjusted. -b | --add-badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The list is added to the fs list of bad blocks. -B | --badblocks file File is the file name of the file that contains the list of blocks to be marked as bad on the fs. The bad block list on the fs is cleared before the list specified in the File is added to the fs. -f | --force Normally reiserfstune will refuse to change a journal of a file system that was created before this journal relocation code. This is because if you change the journal, you cannot go back (without special option --make-journal-standard) to an old kernel that lacks this feature and be able to use your filesytem. This option forces it to do that. Specified more than once it allows to avoid asking for confirmation. --make-journal-standard As it was mentioned above, if your file system has non-standard journal, it can not be mounted on the kernel without journal relocation code. The thing can be changed, the only condition is that there is reserved area on main device of the standard journal size 8193 blocks (it will be so for instance if you convert standard journal to non-standard). Just specify this option when you relocate journal back, or without relocation if you already have it on main device. -u | --uuid UUID Set the universally unique identifier ( UUID ) of the filesystem to UUID (see also uuidgen(8)). The format of the UUID is a series of hex digits separated by hypthens, like this: "c1b9d5a2-f162-11cf-9ece-0020afc76f16". -l | --label LABEL Set the volume label of the filesystem. LABEL can be at most 16 characters long; if it is longer than 16 characters, reiserfstune will truncate it. POSSIBLE SCENARIOS OF USING REISERFSTUNE: 1. You have ReiserFS on /dev/hda1, and you wish to have it working with its journal on the device /dev/journal boot kernel patched with special "relocatable journal support" patch reiserfstune /dev/hda1 --journal-new-device /dev/journal -f mount /dev/hda1 and use. You would like to change max transaction size to 512 blocks reiserfstune -t 512 /dev/hda1 You would like to use your file system on another kernel that doesn't contain relocatable journal support. umount /dev/hda1 reiserfstune /dev/hda1 -j /dev/journal --journal-new-device /dev/hda1 --make-journal-standard mount /dev/hda1 and use. 2. You would like to have ReiserFS on /dev/hda1 and to be able to switch between different journals including journal located on the device containing the filesystem. boot kernel patched with special "relocatable journal support" patch mkreiserfs /dev/hda1 you got solid state disk (perhaps /dev/sda, they typically look like scsi disks) reiserfstune --journal-new-device /dev/sda1 -f /dev/hda1 Your scsi device dies, it is three in the morning, you have an extra IDE device lying around reiserfsck --no-journal-available /dev/hda1 or reiserfsck --rebuild-tree --no-journal-available /dev/hda1 reiserfstune --no-journal-available --journal-new-device /dev/hda1 /dev/hda1 using /dev/hda1 under patched kernel AUTHOR This version of reiserfstune has been written by Vladimir Demidov <vova@namesys.com> and Edward Shishkin <edward@namesys.com>. BUGS Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. SEE ALSO [[reiserfsck|reiserfsck(8)]], [[debugreiserfs|debugreiserfs(8)]], [[mkreiserfs|mkreiserfs(8)]] 722f201541ad177c6d59bcc1e488998bc751c613 Resize reiserfs 0 24 1550 1549 2009-07-02T19:57:30Z Chris goe 2 formatting fixes === NAME === resize_reiserfs - resizer tool for the [[ReiserFS]] filesystem === SYNOPSIS === resize_reiserfs [ -s [+|-]size[K|M|G] ] [ -j dev ] [ -fqv ] ''device'' === DESCRIPTION === The <tt>resize_reiserfs</tt> tool resizes an unmounted ReiserFS file system. It enlarges or shrinks a ReiserFS file system located on a ''device'' so that it will have ''size'' bytes or size=old_size +(-) size bytes if the + or - prefix is used. If the <tt>-s</tt> option is not specified, the filesystem will be resized to fill the given device. The size parameter may have one of the optional modifiers K, M, G, which means the size parameter is given in kilo-, mega-, gigabytes respectively. The <tt>resize_reiserfs</tt> program does not manipulate the size of the ''device''. If you wish to enlarge a filesystem, you must make sure you expand the underlying device first. This can be done using [http://manpages.ubuntu.com/manpages/karmic/man8/cfdisk.8.html cfdisk(8)] for partitions, by deleting the partition and recreating it with a larger size (assuming there is free space after the partition in question). '''Make sure you re-create it with the same starting disk cylinder as before! Otherwise, the resize operation will certainly not work, and you may lose your entire filesystem.''' The <tt>resize_reiserfs</tt> program allows to '''grow''' a ReiserFS '''online''' if there is a free space on block device. If you wish to '''shrink''' a ReiserFS partition, first use <tt>resize_reiserfs</tt> to shrink the file system. You may then use <tt>cfdisk(8)</tt> to shrink the ''device''. '''When shrinking the size of the device, make sure you do not make it smaller than the reduced size of the reiserfs filesystem.''' === OPTIONS === -s [+|-]size Set the new size in bytes. -j dev Set the journal device name. -f Force, do not perform checks. -q Do not print anything but error messages. -v Turn on extra progress status messages (default). === EXIT CODES === * 0 Resizing successful. * -1 Resizing not successful. === EXAMPLES === The following example shows how to test <tt>resize_reiserfs</tt>. Suppose a 2GB ReiserFS filesystem is created on the device <tt>/dev/hda8</tt> and is mounted on <tt>/mnt</tt>. To shrink the device we need to unmount it first, then run <tt>resize_reiserfs</tt> with a size parameter (in this case -1GB): umount /mnt resize_reiserfs -s -1G /dev/hda8 mount /dev/hda8 /mnt === AUTHOR === This version of resize_reiserfs has been written by Alexander Zarochentcev. === BUGS === Please report bugs to the ReiserFS developers {{listaddress}}, providing as much information as possible - your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [http://manpages.ubuntu.com/manpages/karmic/man8/cfdisk.8.html cfdisk(8)], [[category:ReiserFS]] 3786f5a8c3a5a8eb566e04226907c0076981ce04 1549 1548 2009-07-02T19:55:35Z Chris goe 2 -> EXIT CODES === NAME === resize_reiserfs - resizer tool for the [[ReiserFS]] filesystem === SYNOPSIS === resize_reiserfs [ -s [+|-]size[K|M|G] ] [ -j dev ] [ -fqv ] ''device'' === DESCRIPTION === The <tt>resize_reiserfs</tt> tool resizes an unmounted ReiserFS file system. It enlarges or shrinks a ReiserFS file system located on a ''device'' so that it will have ''size'' bytes or size=old_size +(-) size bytes if the + or - prefix is used. If the <tt>-s</tt> option is not specified, the filesystem will be resized to fill the given device. The size parameter may have one of the optional modifiers K, M, G, which means the size parameter is given in kilo-, mega-, gigabytes respectively. The <tt>resize_reiserfs</tt> program does not manipulate the size of the ''device''. If you wish to enlarge a filesystem, you must make sure you expand the underlying device first. This can be done using [http://manpages.ubuntu.com/manpages/karmic/man8/cfdisk.8.html cfdisk(8)] for partitions, by deleting the partition and recreating it with a larger size (assuming there is free space after the partition in question). '''Make sure you re-create it with the same starting disk cylinder as before! Otherwise, the resize operation will certainly not work, and you may lose your entire filesystem.''' The <tt>resize_reiserfs</tt> program allows to '''grow''' a ReiserFS '''online''' if there is a free space on block device. If you wish to '''shrink''' a ReiserFS partition, first use <tt>resize_reiserfs</tt> to shrink the file system. You may then use <tt>cfdisk(8)</tt> to shrink the ''device''. '''When shrinking the size of the device, make sure you do not make it smaller than the reduced size of the reiserfs filesystem.''' === OPTIONS === -s [+|-]size Set the new size in bytes. -j dev Set the journal device name. -f Force, do not perform checks. -q Do not print anything but error messages. -v Turn on extra progress status messages (default). === EXIT CODES === * 0 Resizing successful. * -1 Resizing not successful. === EXAMPLES === The following example shows how to test resize_reiserfs. Suppose 2Gb reiserfs filesystem is created on the device /dev/hda8 and is mounted on /mnt. For shrinking the device we need to unmount it first, then run resize_reiserfs with a size parameter (in this case -1Gb): df umount /mnt resize_reiserfs -s -1G /dev/hda8 mount /dev/hda8 /mnt df /mnt === AUTHOR === This version of resize_reiserfs has been written by Alexander Zarochentcev. === BUGS === Please report bugs to the ReiserFS developers {{listaddress}}, providing as much information as possible - your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [http://manpages.ubuntu.com/manpages/karmic/man8/cfdisk.8.html cfdisk(8)], [[category:ReiserFS]] e06ffff2b56f93be81b9bd6b43653dcc47baa82c 1548 1330 2009-07-02T19:53:39Z Chris goe 2 formatting fixes === NAME === resize_reiserfs - resizer tool for the [[ReiserFS]] filesystem === SYNOPSIS === resize_reiserfs [ -s [+|-]size[K|M|G] ] [ -j dev ] [ -fqv ] ''device'' === DESCRIPTION === The <tt>resize_reiserfs</tt> tool resizes an unmounted ReiserFS file system. It enlarges or shrinks a ReiserFS file system located on a ''device'' so that it will have ''size'' bytes or size=old_size +(-) size bytes if the + or - prefix is used. If the <tt>-s</tt> option is not specified, the filesystem will be resized to fill the given device. The size parameter may have one of the optional modifiers K, M, G, which means the size parameter is given in kilo-, mega-, gigabytes respectively. The <tt>resize_reiserfs</tt> program does not manipulate the size of the ''device''. If you wish to enlarge a filesystem, you must make sure you expand the underlying device first. This can be done using [http://manpages.ubuntu.com/manpages/karmic/man8/cfdisk.8.html cfdisk(8)] for partitions, by deleting the partition and recreating it with a larger size (assuming there is free space after the partition in question). '''Make sure you re-create it with the same starting disk cylinder as before! Otherwise, the resize operation will certainly not work, and you may lose your entire filesystem.''' The <tt>resize_reiserfs</tt> program allows to '''grow''' a ReiserFS '''online''' if there is a free space on block device. If you wish to '''shrink''' a ReiserFS partition, first use <tt>resize_reiserfs</tt> to shrink the file system. You may then use <tt>cfdisk(8)</tt> to shrink the ''device''. '''When shrinking the size of the device, make sure you do not make it smaller than the reduced size of the reiserfs filesystem.''' === OPTIONS === -s [+|-]size Set the new size in bytes. -j dev Set the journal device name. -f Force, do not perform checks. -q Do not print anything but error messages. -v Turn on extra progress status messages (default). === RETURN VALUES === 0 Resizing successful. -1 Resizing not successful. === EXAMPLES === The following example shows how to test resize_reiserfs. Suppose 2Gb reiserfs filesystem is created on the device /dev/hda8 and is mounted on /mnt. For shrinking the device we need to unmount it first, then run resize_reiserfs with a size parameter (in this case -1Gb): df umount /mnt resize_reiserfs -s -1G /dev/hda8 mount /dev/hda8 /mnt df /mnt === AUTHOR === This version of resize_reiserfs has been written by Alexander Zarochentcev. === BUGS === Please report bugs to the ReiserFS developers {{listaddress}}, providing as much information as possible - your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === * [[reiserfsck|reiserfsck(8)]] * [[debugreiserfs|debugreiserfs(8)]] * [http://manpages.ubuntu.com/manpages/karmic/man8/cfdisk.8.html cfdisk(8)], [[category:ReiserFS]] 58de443aaa83760c150703c44f204d29a30f14b5 1330 2009-06-25T07:48:10Z Chris goe 2 http://web.archive.org/web/20061113154808/www.namesys.com/resize_reiserfs.html === NAME === resize_reiserfs - resizer tool for the ReiserFS filesystem === SYNOPSIS === resize_reiserfs [ -s [+|-]size[K|M|G] ] [ -j dev ] [ -fqv ] device === DESCRIPTION === The resize_reiserfs tool resizes an unmounted reiserfs file system. It enlarges or shrinks a reiserfs file system located on a device so that it will have size bytes or size=old_size +(-) size bytes if the + or - prefix is used. If the -s option is not specified, the filesystem will be resized to fill the given device. The size parameter may have one of the optional modifiers K, M, G, which means the size parameter is given in kilo-, mega-, gigabytes respectively. The resize_reiserfs program does not manipulate the size of the device. If you wish to enlarge a filesystem, you must make sure you expand the underlying device first. This can be done using cfdisk(8) for partitions, by deleting the partition and recreating it with a larger size (assuming there is free space after the partition in question). Make sure you re-create it with the same starting disk cylinder as before! Otherwise, the resize operation will certainly not work, and you may lose your entire filesystem. The resize_reiserfs program allows to grow a reiserfs on-line if there is a free space on block device. If you wish to shrink a reiserfs partition, first use resize_reiserfs to shrink the file system. You may then use cfdisk(8) to shrink the device. When shrinking the size of the device, make sure you do not make it smaller than the reduced size of the reiserfs filesystem. === OPTIONS === -s [+|-]size Set the new size in bytes. -j dev Set the journal device name. -f Force, do not perform checks. -q Do not print anything but error messages. -v Turn on extra progress status messages (default). === RETURN VALUES === 0 Resizing successful. -1 Resizing not successful. === EXAMPLES === The following example shows how to test resize_reiserfs. Suppose 2Gb reiserfs filesystem is created on the device /dev/hda8 and is mounted on /mnt. For shrinking the device we need to unmount it first, then run resize_reiserfs with a size parameter (in this case -1Gb): df umount /mnt resize_reiserfs -s -1G /dev/hda8 mount /dev/hda8 /mnt df /mnt === AUTHOR === This version of resize_reiserfs has been written by Alexander Zarochentcev <zam@namesys.com>. === BUGS === Please report bugs to the ReiserFS developers <reiserfs-dev@namesys.com>, providing as much information as possible--your hardware, kernel, patches, settings, all printed messages; check the syslog file for any related information. === SEE ALSO === [http://manpages.ubuntu.com/manpages/karmic/man8/cfdisk.8.html cfdisk(8)], [[reiserfsck|reiserfsck(8)]], [[debugreiserfs|debugreiserfs(8)]] [[category:ReiserFS]] 87ab0c3f46130e3f95399f8f0b93375b55327557 TODO 0 35 4323 4155 2019-04-16T09:03:23Z Chris goe 2 404 fixed When [https://en.wikipedia.org/wiki/Namesys Namesys] was still in place, there was a [https://web.archive.org/web/20071219121554/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more and a lot of changes have been made to Reiser4, the most current TODO list we have is from April 2009 and there're still enough tasks left to be done: * [https://marc.info/?l=reiserfs-devel&m=124069989217533 updated TODO list from 2009-04-25] * [https://www.spinics.net/lists/reiserfs-devel/msg01028.html updated TODO list from 2008-08-05] * [[TODO/2007-11-25|original TODO list from 2007-11-25]] * Storage [[Containers]] That being said, <s>reiser4 inclusion seems to be in sight</s>, rumor has it that another attempt will be made [https://marc.info/?l=reiserfs-devel&m=124904274311847 in late 2010] or [https://www.spinics.net/lists/reiserfs-devel/msg02496.html early 2011]. [[category:Reiser4]] 5f5d0c2039e75a2f3a22037c0b4db6db0e30b510 4155 1882 2016-08-25T21:55:37Z Chris goe 2 Containers added When [http://en.wikipedia.org/wiki/Namesys Namesys] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more and a lot of changes have been made to Reiser4, the most current TODO list we have is from April 2009 and there're still enough tasks left to be done: * [http://marc.info/?l=reiserfs-devel&m=124069989217533&w=2 updated TODO list from 2009-04-25] * [http://www.spinics.net/lists/reiserfs-devel/msg01028.html updated TODO list from 2008-08-05] * [[TODO/2007-11-25|original TODO list from 2007-11-25]] * Storage [[Containers]] That being said, [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made [http://marc.info/?l=reiserfs-devel&m=124904274311847&w=2 in late 2010] or [http://www.spinics.net/lists/reiserfs-devel/msg02496.html early 2011]. [[category:Reiser4]] 8456dbc9562a87ab78ec65933e508fe04ad51a41 1882 1872 2010-10-18T01:10:07Z Chris goe 2 what is Namesys? When [http://en.wikipedia.org/wiki/Namesys Namesys] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more and a lot of changes have been made to Reiser4, the most current TODO list we have is from April 2009 and there're still enough tasks left to be done: * [http://marc.info/?l=reiserfs-devel&m=124069989217533&w=2 updated TODO list from 2009-04-25] * [http://www.spinics.net/lists/reiserfs-devel/msg01028.html updated TODO list from 2008-08-05] * [[TODO/2007-11-25|original TODO list from 2007-11-25]] That being said, [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made [http://marc.info/?l=reiserfs-devel&m=124904274311847&w=2 in late 2010] or [http://www.spinics.net/lists/reiserfs-devel/msg02496.html early 2011]. [[category:Reiser4]] c7ac48d8bca483baf2acd15995618cff36a336a5 1872 1852 2010-10-18T01:05:43Z Chris goe 2 TODO list from 2008-08-05 is still online, linked When [[Namesys]] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more and a lot of changes have been made to Reiser4, the most current TODO list we have is from April 2009 and there're still enough tasks left to be done: * [http://marc.info/?l=reiserfs-devel&m=124069989217533&w=2 updated TODO list from 2009-04-25] * [http://www.spinics.net/lists/reiserfs-devel/msg01028.html updated TODO list from 2008-08-05] * [[TODO/2007-11-25|original TODO list from 2007-11-25]] That being said, [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made [http://marc.info/?l=reiserfs-devel&m=124904274311847&w=2 in late 2010] or [http://www.spinics.net/lists/reiserfs-devel/msg02496.html early 2011]. [[category:Reiser4]] 6d63df45aab8f60e461098d0de24ac2c841b23c7 1852 1620 2010-10-18T00:50:58Z Chris goe 2 possible inclusion date updated When [[Namesys]] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more and a lot of changes have been made to Reiser4, the most current TODO list we have is from April 2009 and there're still enough tasks left to be done: * [http://marc.info/?l=reiserfs-devel&m=124069989217533&w=2 updated TODO list from 2009-04-25] * [[TODO/2008-08-04|updated TODO list from 2008-08-04]] * [[TODO/2007-11-25|original TODO list from 2007-11-25]] That being said, [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made [http://marc.info/?l=reiserfs-devel&m=124904274311847&w=2 in late 2010] or [http://www.spinics.net/lists/reiserfs-devel/msg02496.html early 2011]. [[category:Reiser4]] 783a2e94e88aade34e76eb642c28a3d0d03a5ba5 1620 1619 2009-08-01T08:05:30Z Chris goe 2 formatting fixes When [[Namesys]] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more and a lot of changes have been made to Reiser4, the most current TODO list we have is from April 2009 and there're still enough tasks left to be done: * [http://marc.info/?l=reiserfs-devel&m=124069989217533&w=2 updated TODO list from 2009-04-25] * [[TODO/2008-08-04|updated TODO list from 2008-08-04]] * [[TODO/2007-11-25|original TODO list from 2007-11-25]] That being said, [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made [http://marc.info/?l=reiserfs-devel&m=124904274311847&w=2 later this year]. [[category:Reiser4]] 4ebcd9b33827cb831c07b82dbceb036ea6816f60 1619 1498 2009-08-01T08:00:54Z Chris goe 2 whoohoo! :) When [[Namesys]] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more, but back in 08/2008, Andrew and Edward put together an [http://marc.info/?l=reiserfs-devel&m=121789281206672&w=2 updated Reiser4 TODO list]: * [[TODO/2007-11-25|original TODO list from 2007-11-25]] * [[TODO/2008-08-04|updated TODO list from 2008-08-04]] * [http://marc.info/?l=reiserfs-devel&m=124069989217533&w=2 updated TODO list from 2009-04-25] ---- [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made [http://marc.info/?l=reiserfs-devel&m=124904274311847&w=2 later this year]. [[category:Reiser4]] a90bf95eaa6a9a8ec3ba4b908bd505290b32c66e 1498 1497 2009-06-27T16:54:17Z Chris goe 2 formatting fixes When [[Namesys]] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more, but back in 08/2008, Andrew and Edward put together an [http://marc.info/?l=reiserfs-devel&m=121789281206672&w=2 updated Reiser4 TODO list]: * [[TODO/2007-11-25|original TODO list from 2007-11-25]] * [[TODO/2008-08-04|updated TODO list from 2008-08-04]] * [http://marc.info/?l=reiserfs-devel&m=124069989217533&w=2 updated TODO list from 2009-04-25] ---- [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made later this year. [[category:Reiser4]] ad08864f11c15b0f695c7d81faa794638633a1d5 1497 1410 2009-06-27T11:32:47Z Rn 3 When [[Namesys]] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more, but back in 08/2008, Andrew and Edward put together an [http://marc.info/?l=reiserfs-devel&m=121789281206672&w=2 updated Reiser4 TODO list]: * [[TODO/2007-11-25|original TODO list from 2007-11-25]] * [[TODO/2008-08-04|updated TODO list from 2008-08-04]] Latest TODO: http://marc.info/?l=reiserfs-devel&m=124069989217533&w=2 ---- [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made later this year. [[category:Reiser4]] dc8fc36ba037cab79d7df36c98ac60def200d0e5 1410 1407 2009-06-26T21:31:31Z Chris goe 2 http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo added When [[Namesys]] was still in place, there was a [http://web.archive.org/web/*/http://pub.namesys.com/Reiser4/ToDo TODO list] (archive.org) to work on to get [[Reiser4]] into mainline. This URL is no more, but back in 08/2008, Andrew and Edward put together an [http://marc.info/?l=reiserfs-devel&m=121789281206672&w=2 updated Reiser4 TODO list]: * [[TODO/2007-11-25|original TODO list from 2007-11-25]] * [[TODO/2008-08-04|updated TODO list from 2008-08-04]] ---- [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made later this year. [[category:Reiser4]] 3a4539c2f0af1597841d4465156b4a53a1f3cdf3 1407 1406 2009-06-26T21:28:34Z Chris goe 2 todo lists moved When [[Namesys]] was still in place, there was a [http://pub.namesys.com/Reiser4/ToDo TODO list] to work on to get [[Reiser4]] into mainline. This URL is no more, but back in 08/2008, Andrew and Edward put together an [http://marc.info/?l=reiserfs-devel&m=121789281206672&w=2 updated Reiser4 TODO list]: * [[TODO/2007-11-25|original TODO list from 2007-11-25]] * [[TODO/2008-08-04|updated TODO list from 2008-08-04]] ---- [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made later this year. [[category:Reiser4]] 0011f9ddae9d95421278eeb31d02389d5a6aa1f5 1406 1358 2009-06-26T21:25:20Z Chris goe 2 When [[Namesys]] was still in place, there was a [http://pub.namesys.com/Reiser4/ToDo TODO list] to work on to get [[Reiser4]] into mainline. This URL is no more, but back in 08/2008, Andrew and Edward put together an [http://marc.info/?l=reiserfs-devel&m=121789281206672&w=2 updated Reiser4 TODO list]: <pre> List: reiserfs-devel Subject: Todo for inclusion (updated 05 Aug 2008) From: Edward Shishkin <edward.shishkin () gmail ! com> Date: 2008-08-04 23:35:32 Message-ID: 48979244.3040903 () gmail ! com [Download message RAW] Todo for inclusion (updated 05 Aug 2008) The updated todo-list initially composed by Akpm is attached. It makes sense to try to address the following items (complexity is increased): #10,11: Cleanups. #3 There is a pending patch to review/merge: http://marc.info/?l=reiserfs-devel&m=119316601418489&w=2 #9: I don't see any leaked jref there. Perhaps we need to rewrite this portion of code to make it more clear. #14 needs to be addressed. ["todo_for_inclusion_cached" (text/plain)] Tasks are needed to done for getting reiser4 code into the kernel as found by AKPM in 2.6.18-rc2-mm1 (# means that an issue is done) 1. running igrab() in the writepage() path is really going to hammer inode_lock. Something else will need to be done here. 2. The preferred way of solving the above would be to mark the page as PageWriteback() with set_page_writeback() prior to unlocking it. That'll pin the page and the inode. It does require that the page actually get written later on. If we cannot do that then more thought is needed. 3. If poss, use wake_up_process() rather than wake_up(). That'll save some locking. 4. Running iput() in entd() is a bit surprising. iirc there are various ways in which this can recur into the filesystem, perform I/O, etc. I guess it works.. But again, it will hammer inode_lock. 5. the writeout logic in entd_flush() is interesting (as in "holy cow"). It's very central and really needs some good comments describing what's going on in there - what problems are being solved, which decisions were taken and why, etc. The big comment in page_cache.c is useful. Please maintain it well. Boy, it has some old stuff in it. 6. reiser4_wait_page_writeback() needs commenting. 7. reading the comment in txnmgr.c regarding MAP_SHARED pages: a number of things have changed since then. We have page-becoming-writeable notifications and probably soon we'll always take a pagefault when a MAP_SHARED page transitions from pte-clean to pte-dirty (although I wouldn't recommend that a filesystem rely upon the latter for a while yet). 8. page_cache.c: yes, mpage_end_io_write() and mpage_end_io_read() are pretty generic - we might as well export them. 9. truncate_jnodes_range() looks wrong to me. When we populate gang[], there can be any number of NULL entries placed in it. But the loop which iterates across the now-populated gang[] will bale out when it its the _first_ NULL entry. Any following entries will have a leaked jref() against them. 10. Waaaaaaaaaaaaaaay too many typedefs. 11. There are many coding-style nits. One I will mention is very large number of unneeded braces: * if (foo) { o bar(); } it'd be nice to fix these up sometime. Note: Easy to find and repair with checkpatch.pl script found at http://www.kernel.org/pub/linux/kernel/people/apw/checkpatch/ 12. General comment: the lack of support for extended attributes, access control lists and direct-io is a big problem and it's getting bigger. I don't see how a vendor could support reiser4 without these features and permanent lack of vendor support will hurt. What's the plan here? 13. (from CH) Another issue is the lack of support for blocksize < pagesize. This prevents it from being used across architectures. Even worse when I tried the last time it didn't allow me to create a 64k blocksize filesystem that I could actually test on ppc64. \ 14. set_page_dirty_internal() pokes around in VFS internals. Use set_page_dirty_no_buffers() or create a new library function in mm/page-writeback.c. In particular, it gets the radix-tree dirty tagging out of sync. 15. #wbq.sem should be using a completion for the "wait until entd finishes", not a semaphore. Because there's a teeny theoretical race when using semaphores this way which completions were designed to avoid. (The waker can still be playing with the semaphore when it has gone out of scope on the wakee's stack). 16. #write_page_by_ent(): the "spin until entd thread" thing is gross. This function is really lock-intensive. 17. #entd_flush(): bug: rq->wbc->range_start = rq->page->index << PAGE_CACHE_SHIFT; this can overflow on 32-bit. Need to cast rq->page->index to loff_t. 18. #writeout() is a poor name for a global function. Even things like "txn_restart" are a bit generic-sounding. Low-risk, but the kernel's getting bigger... If it were mine, I'd prefix all these symbols with "r4_". prepare_to_sleep(), page_io(), drop_page(), statfs_type(), pre_commit_hook(), etc, etc, etc, etc. Much namespace pollution. 19. #invalidate_list() is a poorly-chosen global identifier. We already have an invalidate_list() in fs/inode.c, too. Please audit all of reiser4's global identifiers (use nm *.o) for suitable naming choices. 20. #semaphores are deprecated. Please switch to mutexes and/or completions where appropriate and possible. 21. #drop_page() is a worry. Why _does_ reiser4 need to remove pages from pagecache? That isn't a filesystem function. drop_page() appears to leave the no-longer-present page tagged as dirty in the radix-tree. 22. #reiser4_invalidate_pages() is a mix of reiser4 things and of things-which-the-vfs-is-supposed-to-do. It is uncommented and I am unable to work out why it was necessary to do this, and hence what we can do about it. 23. #reiser4_readpages() shouldn't need to clean up the remaining pages on *pages. read_cache_pages() does that now. 24. #<wonders what formatted and unformatted nodes are> A brief glossary might help. 25. #REISER4_ERROR_CODE_BASE actually overlaps real errnos (see include/linux/errno.h). Suggest that it be changed to 1000000 or something. 26. #blocknr_set_add() modifies a list_head without any apparent locking. Certainly without any _documented_ locking... Ditto blocknr_set_destroy(). I'm sure there's locking, but it's harder than it should be to work out what it is. Given that proper locking is in place, the filesystem seems to use list_*_careful() a lot more than is necessary? 27. #It would be clearer to remove `struct blocknr_set' and just use list_head. Last update: 02/06/2007 Reiser4/ToDo (last edited 2007-11-25 21:57:14 by M9132) </pre> ---- [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made later this year. [[category:Reiser4]] fdd1be5186b45820283e92ffcf667ced3a00da30 1358 2009-06-25T09:02:58Z Chris goe 2 rumors.. When [[Namesys]] was still in place, there was a [http://pub.namesys.com/Reiser4/ToDo TODO list] to work on to get [[Reiser4]] into mainline. This URL is no more, but back in 08/2008, Edward put together an [http://marc.info/?l=reiserfs-devel&m=121789281206672&w=2 updated Reiser4 TODO list]. [http://www.nabble.com/reiser4-inclusion--p23116865.html Reiser4 inclusion] seems to be in sight, rumor has it that another attempt will be made later this year. [[category:Reiser4]] 02204b5515e98cc09367b3e6b8c7b91421efdf2c TODO/2007-11-25 0 39 1892 1490 2010-10-18T01:14:36Z Chris goe 2 source added {{wayback|http://pub.namesys.com/Reiser4/ToDo|2007-04-27}} Tasks are needed to done for getting reiser4 code into the kernel as found by AKPM in 2.6.18-rc2-mm1 (# means that an issue is done) Reiser4/ToDo (last edited 2007-04-27 10:20:13 by frink) <pre> Tasks are needed to done for getting reiser4 code into the kernel as found by AKPM in 2.6.18-rc2-mm1 (# means that an issue is done) 1. running igrab() in the writepage() path is really going to hammer inode_lock. Something else will need to be done here. 2. The preferred way of solving the above would be to mark the page as PageWriteback() with set_page_writeback() prior to unlocking it. That'll pin the page and the inode. It does require that the page actually get written later on. If we cannot do that then more thought is needed. 3. If poss, use wake_up_process() rather than wake_up(). That'll save some locking. 4. Running iput() in entd() is a bit surprising. iirc there are various ways in which this can recur into the filesystem, perform I/O, etc. I guess it works.. But again, it will hammer inode_lock. 5. the writeout logic in entd_flush() is interesting (as in "holy cow"). It's very central and really needs some good comments describing what's going on in there - what problems are being solved, which decisions were taken and why, etc. The big comment in page_cache.c is useful. Please maintain it well. Boy, it has some old stuff in it. 6. reiser4_wait_page_writeback() needs commenting. 7. reading the comment in txnmgr.c regarding MAP_SHARED pages: a number of things have changed since then. We have page-becoming-writeable notifications and probably soon we'll always take a pagefault when a MAP_SHARED page transitions from pte-clean to pte-dirty (although I wouldn't recommend that a filesystem rely upon the latter for a while yet). 8. page_cache.c: yes, mpage_end_io_write() and mpage_end_io_read() are pretty generic - we might as well export them. 9. truncate_jnodes_range() looks wrong to me. When we populate gang[], there can be any number of NULL entries placed in it. But the loop which iterates across the now-populated gang[] will bale out when it its the _first_ NULL entry. Any following entries will have a leaked jref() against them. 10. Waaaaaaaaaaaaaaay too many typedefs. 11. There are many coding-style nits. One I will mention is very large number of unneeded braces: * if (foo) { o bar(); } it'd be nice to fix these up sometime. Note: Easy to find and repair with checkpatch.pl script found at http://www.kernel.org/pub/linux/kernel/people/apw/checkpatch/ 12. General comment: the lack of support for extended attributes, access control lists and direct-io is a big problem and it's getting bigger. I don't see how a vendor could support reiser4 without these features and permanent lack of vendor support will hurt. What's the plan here? 13. (from CH) Another issue is the lack of support for blocksize < pagesize. This prevents it from being used across architectures. Even worse when I tried the last time it didn't allow me to create a 64k blocksize filesystem that I could actually test on ppc64. \ 14. set_page_dirty_internal() pokes around in VFS internals. Use set_page_dirty_no_buffers() or create a new library function in mm/page-writeback.c. In particular, it gets the radix-tree dirty tagging out of sync. 15. #wbq.sem should be using a completion for the "wait until entd finishes", not a semaphore. Because there's a teeny theoretical race when using semaphores this way which completions were designed to avoid. (The waker can still be playing with the semaphore when it has gone out of scope on the wakee's stack). 16. #write_page_by_ent(): the "spin until entd thread" thing is gross. This function is really lock-intensive. 17. #entd_flush(): bug: rq->wbc->range_start = rq->page->index << PAGE_CACHE_SHIFT; this can overflow on 32-bit. Need to cast rq->page->index to loff_t. 18. #writeout() is a poor name for a global function. Even things like "txn_restart" are a bit generic-sounding. Low-risk, but the kernel's getting bigger... If it were mine, I'd prefix all these symbols with "r4_". prepare_to_sleep(), page_io(), drop_page(), statfs_type(), pre_commit_hook(), etc, etc, etc, etc. Much namespace pollution. 19. #invalidate_list() is a poorly-chosen global identifier. We already have an invalidate_list() in fs/inode.c, too. Please audit all of reiser4's global identifiers (use nm *.o) for suitable naming choices. 20. #semaphores are deprecated. Please switch to mutexes and/or completions where appropriate and possible. 21. #drop_page() is a worry. Why _does_ reiser4 need to remove pages from pagecache? That isn't a filesystem function. drop_page() appears to leave the no-longer-present page tagged as dirty in the radix-tree. 22. #reiser4_invalidate_pages() is a mix of reiser4 things and of things-which-the-vfs-is-supposed-to-do. It is uncommented and I am unable to work out why it was necessary to do this, and hence what we can do about it. 23. #reiser4_readpages() shouldn't need to clean up the remaining pages on *pages. read_cache_pages() does that now. 24. #<wonders what formatted and unformatted nodes are> A brief glossary might help. 25. #REISER4_ERROR_CODE_BASE actually overlaps real errnos (see include/linux/errno.h). Suggest that it be changed to 1000000 or something. 26. #blocknr_set_add() modifies a list_head without any apparent locking. Certainly without any _documented_ locking... Ditto blocknr_set_destroy(). I'm sure there's locking, but it's harder than it should be to work out what it is. Given that proper locking is in place, the filesystem seems to use list_*_careful() a lot more than is necessary? 27. #It would be clearer to remove `struct blocknr_set' and just use list_head. Reiser4/ToDo (last edited 2007-11-25 21:57:14 by M9132) </pre> [[category:Reiser4]] 30f18e3f718a9baff34ea92989fe791c66aab460 1490 1409 2009-06-27T06:25:00Z Chris goe 2 category added <pre> Tasks are needed to done for getting reiser4 code into the kernel as found by AKPM in 2.6.18-rc2-mm1 (# means that an issue is done) 1. running igrab() in the writepage() path is really going to hammer inode_lock. Something else will need to be done here. 2. The preferred way of solving the above would be to mark the page as PageWriteback() with set_page_writeback() prior to unlocking it. That'll pin the page and the inode. It does require that the page actually get written later on. If we cannot do that then more thought is needed. 3. If poss, use wake_up_process() rather than wake_up(). That'll save some locking. 4. Running iput() in entd() is a bit surprising. iirc there are various ways in which this can recur into the filesystem, perform I/O, etc. I guess it works.. But again, it will hammer inode_lock. 5. the writeout logic in entd_flush() is interesting (as in "holy cow"). It's very central and really needs some good comments describing what's going on in there - what problems are being solved, which decisions were taken and why, etc. The big comment in page_cache.c is useful. Please maintain it well. Boy, it has some old stuff in it. 6. reiser4_wait_page_writeback() needs commenting. 7. reading the comment in txnmgr.c regarding MAP_SHARED pages: a number of things have changed since then. We have page-becoming-writeable notifications and probably soon we'll always take a pagefault when a MAP_SHARED page transitions from pte-clean to pte-dirty (although I wouldn't recommend that a filesystem rely upon the latter for a while yet). 8. page_cache.c: yes, mpage_end_io_write() and mpage_end_io_read() are pretty generic - we might as well export them. 9. truncate_jnodes_range() looks wrong to me. When we populate gang[], there can be any number of NULL entries placed in it. But the loop which iterates across the now-populated gang[] will bale out when it its the _first_ NULL entry. Any following entries will have a leaked jref() against them. 10. Waaaaaaaaaaaaaaay too many typedefs. 11. There are many coding-style nits. One I will mention is very large number of unneeded braces: * if (foo) { o bar(); } it'd be nice to fix these up sometime. Note: Easy to find and repair with checkpatch.pl script found at http://www.kernel.org/pub/linux/kernel/people/apw/checkpatch/ 12. General comment: the lack of support for extended attributes, access control lists and direct-io is a big problem and it's getting bigger. I don't see how a vendor could support reiser4 without these features and permanent lack of vendor support will hurt. What's the plan here? 13. (from CH) Another issue is the lack of support for blocksize < pagesize. This prevents it from being used across architectures. Even worse when I tried the last time it didn't allow me to create a 64k blocksize filesystem that I could actually test on ppc64. \ 14. set_page_dirty_internal() pokes around in VFS internals. Use set_page_dirty_no_buffers() or create a new library function in mm/page-writeback.c. In particular, it gets the radix-tree dirty tagging out of sync. 15. #wbq.sem should be using a completion for the "wait until entd finishes", not a semaphore. Because there's a teeny theoretical race when using semaphores this way which completions were designed to avoid. (The waker can still be playing with the semaphore when it has gone out of scope on the wakee's stack). 16. #write_page_by_ent(): the "spin until entd thread" thing is gross. This function is really lock-intensive. 17. #entd_flush(): bug: rq->wbc->range_start = rq->page->index << PAGE_CACHE_SHIFT; this can overflow on 32-bit. Need to cast rq->page->index to loff_t. 18. #writeout() is a poor name for a global function. Even things like "txn_restart" are a bit generic-sounding. Low-risk, but the kernel's getting bigger... If it were mine, I'd prefix all these symbols with "r4_". prepare_to_sleep(), page_io(), drop_page(), statfs_type(), pre_commit_hook(), etc, etc, etc, etc. Much namespace pollution. 19. #invalidate_list() is a poorly-chosen global identifier. We already have an invalidate_list() in fs/inode.c, too. Please audit all of reiser4's global identifiers (use nm *.o) for suitable naming choices. 20. #semaphores are deprecated. Please switch to mutexes and/or completions where appropriate and possible. 21. #drop_page() is a worry. Why _does_ reiser4 need to remove pages from pagecache? That isn't a filesystem function. drop_page() appears to leave the no-longer-present page tagged as dirty in the radix-tree. 22. #reiser4_invalidate_pages() is a mix of reiser4 things and of things-which-the-vfs-is-supposed-to-do. It is uncommented and I am unable to work out why it was necessary to do this, and hence what we can do about it. 23. #reiser4_readpages() shouldn't need to clean up the remaining pages on *pages. read_cache_pages() does that now. 24. #<wonders what formatted and unformatted nodes are> A brief glossary might help. 25. #REISER4_ERROR_CODE_BASE actually overlaps real errnos (see include/linux/errno.h). Suggest that it be changed to 1000000 or something. 26. #blocknr_set_add() modifies a list_head without any apparent locking. Certainly without any _documented_ locking... Ditto blocknr_set_destroy(). I'm sure there's locking, but it's harder than it should be to work out what it is. Given that proper locking is in place, the filesystem seems to use list_*_careful() a lot more than is necessary? 27. #It would be clearer to remove `struct blocknr_set' and just use list_head. Reiser4/ToDo (last edited 2007-11-25 21:57:14 by M9132) </pre> [[category:Reiser4]] 10e12697d4d6b468e485eb789675c1b04f52c422 1409 2009-06-26T21:29:34Z Chris goe 2 2007-11-25 <pre> Tasks are needed to done for getting reiser4 code into the kernel as found by AKPM in 2.6.18-rc2-mm1 (# means that an issue is done) 1. running igrab() in the writepage() path is really going to hammer inode_lock. Something else will need to be done here. 2. The preferred way of solving the above would be to mark the page as PageWriteback() with set_page_writeback() prior to unlocking it. That'll pin the page and the inode. It does require that the page actually get written later on. If we cannot do that then more thought is needed. 3. If poss, use wake_up_process() rather than wake_up(). That'll save some locking. 4. Running iput() in entd() is a bit surprising. iirc there are various ways in which this can recur into the filesystem, perform I/O, etc. I guess it works.. But again, it will hammer inode_lock. 5. the writeout logic in entd_flush() is interesting (as in "holy cow"). It's very central and really needs some good comments describing what's going on in there - what problems are being solved, which decisions were taken and why, etc. The big comment in page_cache.c is useful. Please maintain it well. Boy, it has some old stuff in it. 6. reiser4_wait_page_writeback() needs commenting. 7. reading the comment in txnmgr.c regarding MAP_SHARED pages: a number of things have changed since then. We have page-becoming-writeable notifications and probably soon we'll always take a pagefault when a MAP_SHARED page transitions from pte-clean to pte-dirty (although I wouldn't recommend that a filesystem rely upon the latter for a while yet). 8. page_cache.c: yes, mpage_end_io_write() and mpage_end_io_read() are pretty generic - we might as well export them. 9. truncate_jnodes_range() looks wrong to me. When we populate gang[], there can be any number of NULL entries placed in it. But the loop which iterates across the now-populated gang[] will bale out when it its the _first_ NULL entry. Any following entries will have a leaked jref() against them. 10. Waaaaaaaaaaaaaaay too many typedefs. 11. There are many coding-style nits. One I will mention is very large number of unneeded braces: * if (foo) { o bar(); } it'd be nice to fix these up sometime. Note: Easy to find and repair with checkpatch.pl script found at http://www.kernel.org/pub/linux/kernel/people/apw/checkpatch/ 12. General comment: the lack of support for extended attributes, access control lists and direct-io is a big problem and it's getting bigger. I don't see how a vendor could support reiser4 without these features and permanent lack of vendor support will hurt. What's the plan here? 13. (from CH) Another issue is the lack of support for blocksize < pagesize. This prevents it from being used across architectures. Even worse when I tried the last time it didn't allow me to create a 64k blocksize filesystem that I could actually test on ppc64. \ 14. set_page_dirty_internal() pokes around in VFS internals. Use set_page_dirty_no_buffers() or create a new library function in mm/page-writeback.c. In particular, it gets the radix-tree dirty tagging out of sync. 15. #wbq.sem should be using a completion for the "wait until entd finishes", not a semaphore. Because there's a teeny theoretical race when using semaphores this way which completions were designed to avoid. (The waker can still be playing with the semaphore when it has gone out of scope on the wakee's stack). 16. #write_page_by_ent(): the "spin until entd thread" thing is gross. This function is really lock-intensive. 17. #entd_flush(): bug: rq->wbc->range_start = rq->page->index << PAGE_CACHE_SHIFT; this can overflow on 32-bit. Need to cast rq->page->index to loff_t. 18. #writeout() is a poor name for a global function. Even things like "txn_restart" are a bit generic-sounding. Low-risk, but the kernel's getting bigger... If it were mine, I'd prefix all these symbols with "r4_". prepare_to_sleep(), page_io(), drop_page(), statfs_type(), pre_commit_hook(), etc, etc, etc, etc. Much namespace pollution. 19. #invalidate_list() is a poorly-chosen global identifier. We already have an invalidate_list() in fs/inode.c, too. Please audit all of reiser4's global identifiers (use nm *.o) for suitable naming choices. 20. #semaphores are deprecated. Please switch to mutexes and/or completions where appropriate and possible. 21. #drop_page() is a worry. Why _does_ reiser4 need to remove pages from pagecache? That isn't a filesystem function. drop_page() appears to leave the no-longer-present page tagged as dirty in the radix-tree. 22. #reiser4_invalidate_pages() is a mix of reiser4 things and of things-which-the-vfs-is-supposed-to-do. It is uncommented and I am unable to work out why it was necessary to do this, and hence what we can do about it. 23. #reiser4_readpages() shouldn't need to clean up the remaining pages on *pages. read_cache_pages() does that now. 24. #<wonders what formatted and unformatted nodes are> A brief glossary might help. 25. #REISER4_ERROR_CODE_BASE actually overlaps real errnos (see include/linux/errno.h). Suggest that it be changed to 1000000 or something. 26. #blocknr_set_add() modifies a list_head without any apparent locking. Certainly without any _documented_ locking... Ditto blocknr_set_destroy(). I'm sure there's locking, but it's harder than it should be to work out what it is. Given that proper locking is in place, the filesystem seems to use list_*_careful() a lot more than is necessary? 27. #It would be clearer to remove `struct blocknr_set' and just use list_head. Reiser4/ToDo (last edited 2007-11-25 21:57:14 by M9132) </pre> b09008a83897ca93244bdbabd8e9c23751b8b8e4 Testimonials 0 33 2511 1376 2012-09-25T17:36:57Z Chris goe 2 -> = from Philipp Guehring = ReiserFS is the main engine behind our LivingXML database system. After we found that other XML databases simply can not provide the needed scalability, we started to develop a native XML engine based on ReiserFS. With the great help of ReiserFS, we now have one of the best database systems, which is scalable, flexible, and just does what our customers need. Filesystems are the best databases we have, but only a few people seem to care or use them appropriatly. We have created a database application with ~ 250 000 XML files, and searches, updates, ... are performing very well. Just visit http://www.livingxml.net/ to see the combination of XML and ReiserFS. = from [http://sf.net SourceForge] = FTP server of Source Forge has 850GB storage, half of which is reiserfs, half is ext2. Both filesystems have been running flawlessly for > 4 months of production (actually longer, but wasn't reiserfs before). That server pushes between 15Mbit and 50Mbit/sec, and pulls/syncs about 2-5Mbit/sec, 24x7. reiserfs also powers the CVS tree filesystem for cvs-mirror.mozilla.org (also tokyojoe.sourceforge.net), which is the one and only anonymous CVS checkout point for mozilla. That server has run flawlessly under very heavy load since its birth. I don't get involved in kernel politics, but as a production filesystem, reiserfs is ok in my book. = from another happy user = ReiserFS is running on our production squid server (20G spool), and production news server (currently 40G, will be 500G soon), serving about 500 DSL customers. Never had a problem, upgraded from 3.5 to 3.6 without a hitch. = from Kenneth C. Arnold = Wow -- what a filesystem. You know which one I'm talking about ;) Just had a major freeze (v. rare with Linux, but it happens) where I had to use the one-finger kick. I had converted /usr to ReiserFS, but hadn't gotten to /var yet. Well needless to say /usr survived, and /var... ever tried having duplicate blocks between (in my case) (/var)/lib/dpkg/info/libc6-prerm (a file) and (/var)/ (quite definately a directory)? I have always thought of Linux filesystems as being very delicate in comparison to the filesystem for That Other OS ... until now. = from Bryan Campbell = As expected . . . the patch was quite successful. I am totally amazed at reiserfs 422GB of news on the spool at 1:01 a.m. (expire started) 320GB of news on the spool at 1:03 a.m. (expire finished) I have to admit that I am not privy to a lot of high performance file systems of proprietary nature. But, I know of no file system that can release that kind of space on an innd installation in under 3 minutes. Mind you that is the entire expiration time. The current history file is 146MB. That three minutes includes parsing the history file. Even when it was broken, it would expire 250GB of news in under 7 minutes. With ext2 it took roughly twice the time. I also have a cache server (squid 2.4.blah) running over 4X10GB cache-dirs. It sustains loads of between 300-600 concurrent connections. ext2 just fell over at about 425 connections. I have yet to reach the limit of reiserfs on the cache server. And, I still have yet to install the current patch. I am guessing that I will try reiser raw with squid before too long. I know this is not supposed to be a testimonial, but, how's this . . . try fsck'ing a 608GB news spool with ext2. NOT! All my thanks to you all who have developed reiserfs and supported us on the list. [[category:ReiserFS]] [[category:Reiser4]] 963ae0bb626e1b17460992383838effddb3e2c83 1376 1368 2009-06-25T09:57:01Z Chris goe 2 == from Philipp Guehring == ReiserFS is the main engine behind our LivingXML database system. After we found that other XML databases simply can not provide the needed scalability, we started to develop a native XML engine based on ReiserFS. With the great help of ReiserFS, we now have one of the best database systems, which is scalable, flexible, and just does what our customers need. Filesystems are the best databases we have, but only a few people seem to care or use them appropriatly. We have created a database application with ~ 250 000 XML files, and searches, updates, ... are performing very well. Just visit http://www.livingxml.net/ to see the combination of XML and ReiserFS. == from [http://sf.net SourceForge] == FTP server of Source Forge has 850GB storage, half of which is reiserfs, half is ext2. Both filesystems have been running flawlessly for > 4 months of production (actually longer, but wasn't reiserfs before). That server pushes between 15Mbit and 50Mbit/sec, and pulls/syncs about 2-5Mbit/sec, 24x7. reiserfs also powers the CVS tree filesystem for cvs-mirror.mozilla.org (also tokyojoe.sourceforge.net), which is the one and only anonymous CVS checkout point for mozilla. That server has run flawlessly under very heavy load since its birth. I don't get involved in kernel politics, but as a production filesystem, reiserfs is ok in my book. == from another happy user == ReiserFS is running on our production squid server (20G spool), and production news server (currently 40G, will be 500G soon), serving about 500 DSL customers. Never had a problem, upgraded from 3.5 to 3.6 without a hitch. == from Kenneth C. Arnold == Wow -- what a filesystem. You know which one I'm talking about ;) Just had a major freeze (v. rare with Linux, but it happens) where I had to use the one-finger kick. I had converted /usr to ReiserFS, but hadn't gotten to /var yet. Well needless to say /usr survived, and /var... ever tried having duplicate blocks between (in my case) (/var)/lib/dpkg/info/libc6-prerm (a file) and (/var)/ (quite definately a directory)? I have always thought of Linux filesystems as being very delicate in comparison to the filesystem for That Other OS ... until now. == from Bryan Campbell == As expected . . . the patch was quite successful. I am totally amazed at reiserfs 422GB of news on the spool at 1:01 a.m. (expire started) 320GB of news on the spool at 1:03 a.m. (expire finished) I have to admit that I am not privy to a lot of high performance file systems of proprietary nature. But, I know of no file system that can release that kind of space on an innd installation in under 3 minutes. Mind you that is the entire expiration time. The current history file is 146MB. That three minutes includes parsing the history file. Even when it was broken, it would expire 250GB of news in under 7 minutes. With ext2 it took roughly twice the time. I also have a cache server (squid 2.4.blah) running over 4X10GB cache-dirs. It sustains loads of between 300-600 concurrent connections. ext2 just fell over at about 425 connections. I have yet to reach the limit of reiserfs on the cache server. And, I still have yet to install the current patch. I am guessing that I will try reiser raw with squid before too long. I know this is not supposed to be a testimonial, but, how's this . . . try fsck'ing a 608GB news spool with ext2. NOT! All my thanks to you all who have developed reiserfs and supported us on the list. [[category:ReiserFS]] [[category:Reiser4]] 00fabbd8b7b870f4b0622f0bbac93f3abf3414ab 1368 1343 2009-06-25T09:10:28Z Chris goe 2 categories added > From Philipp Guehring: ReiserFS is the main engine behind our LivingXML database system. After we found that other XML databases simply can not provide the needed scalability, we started to develop a native XML engine based on ReiserFS. With the great help of ReiserFS, we now have one of the best database systems, which is scalable, flexible, and just does what our customers need. Filesystems are the best databases we have, but only a few people seem to care or use them appropriatly. We have created a database application with ~ 250 000 XML files, and searches, updates, ... are performing very well. Just visit http://www.livingxml.net/ to see the combination of XML and ReiserFS. > From Source Forge FTP server of Source Forge has 850GB storage, half of which is reiserfs, half is ext2. Both filesystems have been running flawlessly for > 4 months of production (actually longer, but wasn't reiserfs before). That server pushes between 15Mbit and 50Mbit/sec, and pulls/syncs about 2-5Mbit/sec, 24x7. reiserfs also powers the CVS tree filesystem for cvs-mirror.mozilla.org (also tokyojoe.sourceforge.net), which is the one and only anonymous CVS checkout point for mozilla. That server has run flawlessly under very heavy load since its birth. I don't get involved in kernel politics, but as a production filesystem, reiserfs is ok in my book. >From another happy user: ReiserFS is running on our production squid server (20G spool), and production news server (currently 40G, will be 500G soon), serving about 500 DSL customers. Never had a problem, upgraded from 3.5 to 3.6 without a hitch. >From Kenneth C. Arnold : Wow -- what a filesystem. You know which one I'm talking about ;) Just had a major freeze (v. rare with Linux, but it happens) where I had to use the one-finger kick. I had converted /usr to ReiserFS, but hadn't gotten to /var yet. Well needless to say /usr survived, and /var... ever tried having duplicate blocks between (in my case) (/var)/lib/dpkg/info/libc6-prerm (a file) and (/var)/ (quite definately a directory)? I have always thought of Linux filesystems as being very delicate in comparison to the filesystem for That Other OS ... until now. >From Bryan Campbell: As expected . . . the patch was quite successful. I am totally amazed at reiserfs 422GB of news on the spool at 1:01 a.m. (expire started) 320GB of news on the spool at 1:03 a.m. (expire finished) I have to admit that I am not privy to a lot of high performance file systems of proprietary nature. But, I know of no file system that can release that kind of space on an innd installation in under 3 minutes. Mind you that is the entire expiration time. The current history file is 146MB. That three minutes includes parsing the history file. Even when it was broken, it would expire 250GB of news in under 7 minutes. With ext2 it took roughly twice the time. I also have a cache server (squid 2.4.blah) running over 4X10GB cache-dirs. It sustains loads of between 300-600 concurrent connections. ext2 just fell over at about 425 connections. I have yet to reach the limit of reiserfs on the cache server. And, I still have yet to install the current patch. I am guessing that I will try reiser raw with squid before too long. I know this is not supposed to be a testimonial, but, how's this . . . try fsck'ing a 608GB news spool with ext2. NOT! All my thanks to you all who have developed reiserfs and supported us on the list. [[category:ReiserFS]] [[category:Reiser4]] d7315bc9c75167f40293952c282f9cf831ba0c7e 1343 2009-06-25T08:08:01Z Chris goe 2 http://web.archive.org/web/20061113154629/www.namesys.com/testm.html > From Philipp Guehring: ReiserFS is the main engine behind our LivingXML database system. After we found that other XML databases simply can not provide the needed scalability, we started to develop a native XML engine based on ReiserFS. With the great help of ReiserFS, we now have one of the best database systems, which is scalable, flexible, and just does what our customers need. Filesystems are the best databases we have, but only a few people seem to care or use them appropriatly. We have created a database application with ~ 250 000 XML files, and searches, updates, ... are performing very well. Just visit http://www.livingxml.net/ to see the combination of XML and ReiserFS. > From Source Forge FTP server of Source Forge has 850GB storage, half of which is reiserfs, half is ext2. Both filesystems have been running flawlessly for > 4 months of production (actually longer, but wasn't reiserfs before). That server pushes between 15Mbit and 50Mbit/sec, and pulls/syncs about 2-5Mbit/sec, 24x7. reiserfs also powers the CVS tree filesystem for cvs-mirror.mozilla.org (also tokyojoe.sourceforge.net), which is the one and only anonymous CVS checkout point for mozilla. That server has run flawlessly under very heavy load since its birth. I don't get involved in kernel politics, but as a production filesystem, reiserfs is ok in my book. >From another happy user: ReiserFS is running on our production squid server (20G spool), and production news server (currently 40G, will be 500G soon), serving about 500 DSL customers. Never had a problem, upgraded from 3.5 to 3.6 without a hitch. >From Kenneth C. Arnold : Wow -- what a filesystem. You know which one I'm talking about ;) Just had a major freeze (v. rare with Linux, but it happens) where I had to use the one-finger kick. I had converted /usr to ReiserFS, but hadn't gotten to /var yet. Well needless to say /usr survived, and /var... ever tried having duplicate blocks between (in my case) (/var)/lib/dpkg/info/libc6-prerm (a file) and (/var)/ (quite definately a directory)? I have always thought of Linux filesystems as being very delicate in comparison to the filesystem for That Other OS ... until now. >From Bryan Campbell: As expected . . . the patch was quite successful. I am totally amazed at reiserfs 422GB of news on the spool at 1:01 a.m. (expire started) 320GB of news on the spool at 1:03 a.m. (expire finished) I have to admit that I am not privy to a lot of high performance file systems of proprietary nature. But, I know of no file system that can release that kind of space on an innd installation in under 3 minutes. Mind you that is the entire expiration time. The current history file is 146MB. That three minutes includes parsing the history file. Even when it was broken, it would expire 250GB of news in under 7 minutes. With ext2 it took roughly twice the time. I also have a cache server (squid 2.4.blah) running over 4X10GB cache-dirs. It sustains loads of between 300-600 concurrent connections. ext2 just fell over at about 425 connections. I have yet to reach the limit of reiserfs on the cache server. And, I still have yet to install the current patch. I am guessing that I will try reiser raw with squid before too long. I know this is not supposed to be a testimonial, but, how's this . . . try fsck'ing a 608GB news spool with ext2. NOT! All my thanks to you all who have developed reiserfs and supported us on the list. 318bafd56e73905f408facad8876a7dd70db904b Transparent File Migration 0 1114 4449 4448 2020-11-13T22:29:35Z Edward 4 /* FAQ */ Migration of data blocks in a logical volumes cah happen not only in the context of volume balancing procedure, aiming to keep fairness of distribution on the whole logical volume. Reiser5 allows user to migrate data of any specified file to any specified brick of the logical volume. Also, user can mark any regular file as "immobile", so that volume balancing procedures will ignore that file. Moreover, user can clear "immobile" status of any specified file, so that the file will be movable again by volume balancing procedures. Finally, user can run a special procedure, which will clear "immobile" status of all files on the volume and distribute their data in a fair manner. In particular, using this functionality, user is able to push out "hot" files on any high-performance device (e.g. proxy device) and pin them there. = File Migration: API = /* * Migrate file to specified target device. * @fd: file descriptor * @idx: serial number of target device in the logical volume */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_MIGRATE_FILE; args.val = idx; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. After ioctl successful completion the file is not necessarily written to the target device! To make sure of it, call fsync(2) after successful ioctl completion, or open the file with O_SYNC flag before migration. COMMENT. File migration is serialized with brick removal procedures by the file system (to ensure we don't migrate data to a brick removed by a concurrent procedure). Interrupted file migration procedure should be completed in the next or current mount session (depending on the interrupt reason). = Set file immobile status: API = /* * Set file "immobile". * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_SET_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. The immobile status guarantees that any data block of that file won't migrate to another device-component of the logical volume. Note, however, that such block can be easily relocated within device where it currently resides (once the file system finds better location for that block, etc). NOTE: All balancing procedures, which complete device removal, will ignore "immobile" status of any file. After device removal successful completion all data blocks of "immobile" files will be relocated to the remaining devices in accordance with current distribution policy. NOTE: Any selective file migration described above will ignore "immobile" status of the file! So the "immobile" status is honored only by volume balancing procedures, completing some operations such as adding a device to the logical volume, changing capacity of some device or flushing a proxy device. = Clear File immobile status: API = /* * Clear file "immobile" status. * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_CLR_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); NOTE: Selective file migration can make your distribution unfair! Currently it is strongly recommended to migrate files only to devices, which don't participate in regular data distribution e.g. to proxy brick, or to meta-data brick (on condition that it doesn't participate in regular data distribution). In the future it will be possible to turn off builtin distribution on any volume. in this case user will be responsible for appointing a destination device for any file on that volume. = File migration by volume.reiser4 tool = You can use volume.reiser4(8) utility for file migration as well as for setting/clearing file "immobile" status. To migrate a regular file just execute #volume.reiser4 -m N FILENAME where N is serial number of target device (i.e. device, that the file is supposed to migrate to), FILENAME is name of the file to migrate. To set immobile status simply execute #volume.reiser4 -i FILENAME To clear immobile status: #volume.reiser4 -e FILENAME = Restore regular distribution on your logical volume = By default each data stripe in a logical is a subject for a regular distribution policy, which provides fair distribution among all bricks. By migrating a file, user violates the fairness of distribution, which can result in losing space usage efficiency on your logical volume, unexpected ENOSPC errors, etc. And it can happen that user will want to restore regular fair distribution on his logical volume. It can be done by running volume.reiser4 utility with the option -S (--restore-regular) on that volume: #volume.reiser4 -S MOUNTPOINT Actually, it looks like usual balancing, which scans the volume and migrates data of each file, cleaning up its "immobile" status. = Holding "hot" files on Proxy Device = It makes sense to relocate data of "hot" files to one, or more devices, which have the highest performance in the logical volume, e.g. to [[Proxy_Device_Administration|proxy device]]. For this you will need to mark every such file as "immobile" and move it to the desired device, so that balancing procedures (including flushing a proxy device) will ignore those files. See Appendix below for example. = Examples = In this example we'll move a file to 1) proxy and 2) regular data brick and pin it there. Create ID of logical volume: # VOL_ID=`uuidgen` Prepare 2 bricks for our logical volume, /dev/vdc2 for meta-data brick and /dev/vdc3 for proxy-device: # DEV1=/dev/vdc2 # DEV2=/dev/vdc3 # mkfs.reiser4 -U $VOL_ID -y -t 256K $DEV1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K $DEV2 Mount a logical volume consisting of one meta-data brick: # MNT=/mnt/test # mount $DEV1 $MNT Add proxy-device to the logical volume # volume.reiser4 -x $DEV2 $MNT Create a 400K file (100 logical blocks) on our logical volume: # dd if=/dev/zero of=${MNT}/myfile bs=4K count=100 # sync Print all bricks: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 contains 100 data blocks (blocks used - system blocks) = 215 - 115 Flush proxy device: # volume.reiser4 -b $MNT Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 115 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, all 100 data blocks were migrated to the meta-data brick /dev/vdc2 (block used = system blocks + data blocks + meta-data blocks = 115 + 100 + 1 = 216) Mark myfile as immobile and migrate it to the proxy-device: # volume.reiser4 -i ${MNT}/myfile # volume.reiser4 -m 1 ${MNT}/myfile Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 again contains all the data blocks. NOTE: file was migrated in spite of immobile status, because selective migration ignores that status. Now flush proxy device and make sure that the file remains on the proxy device: # volume.reiser4 -b $MNT # sync # volume.reiser4 $MNT -p0 # volume.reiser4 $MNT -p1 As we can see, flushing procedure respects immobile status. Finally, remove the proxy device from the logical volume: # volume.reiser4 -r $DEV2 $MNT Print the single remaining brick of our logical volume: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, file was migrated to the remaining brick /dev/vdc2 in spite of its immobile status. This is because operation of removing a device ignores that status. NOTE: the file remains immobile! Now add /dev/vdc3 as regular device (not proxy) and move the file to that device: # volume.reiser4 -a $DEV2 $MNT # volume.reiser4 -m 1 ${MNT}/myfile Print info about all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, all data blocks of the file now reside at /dev/vdc3 = FAQ = Q: How to find out serial number of device /dev/sdc1 in my logical volume mounted at /mnt A: Find out total number of devices in your logical volume, executing "volume.reiser4 /mnt". Then print all volume components by executing "volume.reiser4 /mnt -p i" in a loop for i = 0,.., N-1, where N - number of devices in your logical volume. Find out, which i is corresponding to /dev/sdc1. If you find this too complicated, feel free to send a patch for more simple procedure of serial number calculation :) [[category:Reiser4]] 7c18f3d048d83c25e102c2d4872b3cc719cee7cc 4448 4447 2020-11-12T17:05:51Z Edward 4 /* Examples */ Migration of data blocks in a logical volumes cah happen not only in the context of volume balancing procedure, aiming to keep fairness of distribution on the whole logical volume. Reiser5 allows user to migrate data of any specified file to any specified brick of the logical volume. Also, user can mark any regular file as "immobile", so that volume balancing procedures will ignore that file. Moreover, user can clear "immobile" status of any specified file, so that the file will be movable again by volume balancing procedures. Finally, user can run a special procedure, which will clear "immobile" status of all files on the volume and distribute their data in a fair manner. In particular, using this functionality, user is able to push out "hot" files on any high-performance device (e.g. proxy device) and pin them there. = File Migration: API = /* * Migrate file to specified target device. * @fd: file descriptor * @idx: serial number of target device in the logical volume */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_MIGRATE_FILE; args.val = idx; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. After ioctl successful completion the file is not necessarily written to the target device! To make sure of it, call fsync(2) after successful ioctl completion, or open the file with O_SYNC flag before migration. COMMENT. File migration is serialized with brick removal procedures by the file system (to ensure we don't migrate data to a brick removed by a concurrent procedure). Interrupted file migration procedure should be completed in the next or current mount session (depending on the interrupt reason). = Set file immobile status: API = /* * Set file "immobile". * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_SET_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. The immobile status guarantees that any data block of that file won't migrate to another device-component of the logical volume. Note, however, that such block can be easily relocated within device where it currently resides (once the file system finds better location for that block, etc). NOTE: All balancing procedures, which complete device removal, will ignore "immobile" status of any file. After device removal successful completion all data blocks of "immobile" files will be relocated to the remaining devices in accordance with current distribution policy. NOTE: Any selective file migration described above will ignore "immobile" status of the file! So the "immobile" status is honored only by volume balancing procedures, completing some operations such as adding a device to the logical volume, changing capacity of some device or flushing a proxy device. = Clear File immobile status: API = /* * Clear file "immobile" status. * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_CLR_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); NOTE: Selective file migration can make your distribution unfair! Currently it is strongly recommended to migrate files only to devices, which don't participate in regular data distribution e.g. to proxy brick, or to meta-data brick (on condition that it doesn't participate in regular data distribution). In the future it will be possible to turn off builtin distribution on any volume. in this case user will be responsible for appointing a destination device for any file on that volume. = File migration by volume.reiser4 tool = You can use volume.reiser4(8) utility for file migration as well as for setting/clearing file "immobile" status. To migrate a regular file just execute #volume.reiser4 -m N FILENAME where N is serial number of target device (i.e. device, that the file is supposed to migrate to), FILENAME is name of the file to migrate. To set immobile status simply execute #volume.reiser4 -i FILENAME To clear immobile status: #volume.reiser4 -e FILENAME = Restore regular distribution on your logical volume = By default each data stripe in a logical is a subject for a regular distribution policy, which provides fair distribution among all bricks. By migrating a file, user violates the fairness of distribution, which can result in losing space usage efficiency on your logical volume, unexpected ENOSPC errors, etc. And it can happen that user will want to restore regular fair distribution on his logical volume. It can be done by running volume.reiser4 utility with the option -S (--restore-regular) on that volume: #volume.reiser4 -S MOUNTPOINT Actually, it looks like usual balancing, which scans the volume and migrates data of each file, cleaning up its "immobile" status. = Holding "hot" files on Proxy Device = It makes sense to relocate data of "hot" files to one, or more devices, which have the highest performance in the logical volume, e.g. to [[Proxy_Device_Administration|proxy device]]. For this you will need to mark every such file as "immobile" and move it to the desired device, so that balancing procedures (including flushing a proxy device) will ignore those files. See Appendix below for example. = Examples = In this example we'll move a file to 1) proxy and 2) regular data brick and pin it there. Create ID of logical volume: # VOL_ID=`uuidgen` Prepare 2 bricks for our logical volume, /dev/vdc2 for meta-data brick and /dev/vdc3 for proxy-device: # DEV1=/dev/vdc2 # DEV2=/dev/vdc3 # mkfs.reiser4 -U $VOL_ID -y -t 256K $DEV1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K $DEV2 Mount a logical volume consisting of one meta-data brick: # MNT=/mnt/test # mount $DEV1 $MNT Add proxy-device to the logical volume # volume.reiser4 -x $DEV2 $MNT Create a 400K file (100 logical blocks) on our logical volume: # dd if=/dev/zero of=${MNT}/myfile bs=4K count=100 # sync Print all bricks: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 contains 100 data blocks (blocks used - system blocks) = 215 - 115 Flush proxy device: # volume.reiser4 -b $MNT Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 115 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, all 100 data blocks were migrated to the meta-data brick /dev/vdc2 (block used = system blocks + data blocks + meta-data blocks = 115 + 100 + 1 = 216) Mark myfile as immobile and migrate it to the proxy-device: # volume.reiser4 -i ${MNT}/myfile # volume.reiser4 -m 1 ${MNT}/myfile Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 again contains all the data blocks. NOTE: file was migrated in spite of immobile status, because selective migration ignores that status. Now flush proxy device and make sure that the file remains on the proxy device: # volume.reiser4 -b $MNT # sync # volume.reiser4 $MNT -p0 # volume.reiser4 $MNT -p1 As we can see, flushing procedure respects immobile status. Finally, remove the proxy device from the logical volume: # volume.reiser4 -r $DEV2 $MNT Print the single remaining brick of our logical volume: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, file was migrated to the remaining brick /dev/vdc2 in spite of its immobile status. This is because operation of removing a device ignores that status. NOTE: the file remains immobile! Now add /dev/vdc3 as regular device (not proxy) and move the file to that device: # volume.reiser4 -a $DEV2 $MNT # volume.reiser4 -m 1 ${MNT}/myfile Print info about all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, all data blocks of the file now reside at /dev/vdc3 = FAQ = Q: How to find out serial number of device /dev/sdc1 in my logical volume mounted at /mnt A: Find out total number of devices in your logical volume, executing "volume.reiser4 /mnt". Then print all volume components by executing "volume.reiser4 /mnt -p i" in a loop for i = 0,.., N-1, where N - number of devices in your logical volume. Find out, which i is corresponding to /dev/sdc1. If you find this too complicated, feel free to send a patch for more simple procedure of serial number calculation :) dbed2ba302fb903708352602507a4d108d50d92e 4447 4441 2020-11-12T17:02:15Z Chris goe 2 link wikified Migration of data blocks in a logical volumes cah happen not only in the context of volume balancing procedure, aiming to keep fairness of distribution on the whole logical volume. Reiser5 allows user to migrate data of any specified file to any specified brick of the logical volume. Also, user can mark any regular file as "immobile", so that volume balancing procedures will ignore that file. Moreover, user can clear "immobile" status of any specified file, so that the file will be movable again by volume balancing procedures. Finally, user can run a special procedure, which will clear "immobile" status of all files on the volume and distribute their data in a fair manner. In particular, using this functionality, user is able to push out "hot" files on any high-performance device (e.g. proxy device) and pin them there. = File Migration: API = /* * Migrate file to specified target device. * @fd: file descriptor * @idx: serial number of target device in the logical volume */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_MIGRATE_FILE; args.val = idx; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. After ioctl successful completion the file is not necessarily written to the target device! To make sure of it, call fsync(2) after successful ioctl completion, or open the file with O_SYNC flag before migration. COMMENT. File migration is serialized with brick removal procedures by the file system (to ensure we don't migrate data to a brick removed by a concurrent procedure). Interrupted file migration procedure should be completed in the next or current mount session (depending on the interrupt reason). = Set file immobile status: API = /* * Set file "immobile". * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_SET_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. The immobile status guarantees that any data block of that file won't migrate to another device-component of the logical volume. Note, however, that such block can be easily relocated within device where it currently resides (once the file system finds better location for that block, etc). NOTE: All balancing procedures, which complete device removal, will ignore "immobile" status of any file. After device removal successful completion all data blocks of "immobile" files will be relocated to the remaining devices in accordance with current distribution policy. NOTE: Any selective file migration described above will ignore "immobile" status of the file! So the "immobile" status is honored only by volume balancing procedures, completing some operations such as adding a device to the logical volume, changing capacity of some device or flushing a proxy device. = Clear File immobile status: API = /* * Clear file "immobile" status. * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_CLR_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); NOTE: Selective file migration can make your distribution unfair! Currently it is strongly recommended to migrate files only to devices, which don't participate in regular data distribution e.g. to proxy brick, or to meta-data brick (on condition that it doesn't participate in regular data distribution). In the future it will be possible to turn off builtin distribution on any volume. in this case user will be responsible for appointing a destination device for any file on that volume. = File migration by volume.reiser4 tool = You can use volume.reiser4(8) utility for file migration as well as for setting/clearing file "immobile" status. To migrate a regular file just execute #volume.reiser4 -m N FILENAME where N is serial number of target device (i.e. device, that the file is supposed to migrate to), FILENAME is name of the file to migrate. To set immobile status simply execute #volume.reiser4 -i FILENAME To clear immobile status: #volume.reiser4 -e FILENAME = Restore regular distribution on your logical volume = By default each data stripe in a logical is a subject for a regular distribution policy, which provides fair distribution among all bricks. By migrating a file, user violates the fairness of distribution, which can result in losing space usage efficiency on your logical volume, unexpected ENOSPC errors, etc. And it can happen that user will want to restore regular fair distribution on his logical volume. It can be done by running volume.reiser4 utility with the option -S (--restore-regular) on that volume: #volume.reiser4 -S MOUNTPOINT Actually, it looks like usual balancing, which scans the volume and migrates data of each file, cleaning up its "immobile" status. = Holding "hot" files on Proxy Device = It makes sense to relocate data of "hot" files to one, or more devices, which have the highest performance in the logical volume, e.g. to [[Proxy_Device_Administration|proxy device]]. For this you will need to mark every such file as "immobile" and move it to the desired device, so that balancing procedures (including flushing a proxy device) will ignore those files. See Appendix below for example. = Examples = In this example we'll move a file to 1) proxy and 2) regular data brick and pin it there. Create ID of logical volume: # VOL_ID=`uuidgen` Prepare 2 bricks for our logical volume, /dev/vdc2 for meta-data brick and /dev/vdc3 for proxy-device: # DEV1=/dev/vdc2 # DEV2=/dev/vdc3 # mkfs.reiser4 -U $VOL_ID -y -t 256K $DEV1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K $DEV2 Mount a logical volume consisting of one meta-data brick: # MNT=/mnt/test # mount $DEV1 $MNT Add proxy-device to the logical volume # volume.reiser4 -x $DEV2 $MNT Create a 400K file (100 logical blocks) on our logical volume: # dd if=/dev/zero of=${MNT}/myfile bs=4K count=100 # sync Print all bricks: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 contains 100 data blocks (blocks used - system blocks) = 215 - 115 Flush proxy device: # volume.reiser4 -b $MNT Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 115 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, all 100 data blocks were migrated to the meta-data brick /dev/vdc2 (block used = system blocks + data blocks + meta-data blocks = 115 + 100 + 1 = 216) Mark myfile as immobile and migrate it to the proxy-device: # volume.reiser4 -i ${MNT}/myfile # volume.reiser4 -m 1 ${MNT}/myfile Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 again contains all the data blocks. NOTE: file was migrated in spite of immobile status, because selective migration ignores that status. Now flush proxy device and make sure that the file remains on the proxy device: # volume.reiser4 -b $MNT # sync # volume.reiser4 $MNT -p0 # volume.reiser4 $MNT -p1 As we can see, flushing procedure respects immobile status. Finally, remove the proxy device from the logical volume: # volume.reiser4 -r $DEV2 $MNT Print the single remaining brick of our logical volume: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, file was migrated to the remaining brick /dev/vdc2 in spite of its immobile status. This is because operation of removing a device ignores that status. NOTE: the file remains immobile! Now add /dev/vdc3 as regular device (not proxy) and move the file to that device: # volume.reiser4 -a $DEV2 $MNT # volume.reiser4 -m 1 ${MNT}/myfile Print info about all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, all data blocks of the file now reside at /dev/vdc3 = FAQ = Q: How to find out serial number of device /dev/sdc1 in my logical volume mounted at /mnt A: Find out total number of devices in your logical volume, executing "volume.reiser4 /mnt". Then print all volume components by executing "volume.reiser4 /mnt -p i" in a loop for i = 0,.., N-1, where N - number of devices in your logical volume. Find out, which i is corresponding to /dev/sdc1. If you find this too complicated, feel free to send a patch for more simple procedure of serial number calculation :) a4ea28c7c5e263eae4004cf807cde3615258a370 4441 4439 2020-11-12T16:28:42Z Edward 4 /* Holding "hot" files on Proxy Device */ Migration of data blocks in a logical volumes cah happen not only in the context of volume balancing procedure, aiming to keep fairness of distribution on the whole logical volume. Reiser5 allows user to migrate data of any specified file to any specified brick of the logical volume. Also, user can mark any regular file as "immobile", so that volume balancing procedures will ignore that file. Moreover, user can clear "immobile" status of any specified file, so that the file will be movable again by volume balancing procedures. Finally, user can run a special procedure, which will clear "immobile" status of all files on the volume and distribute their data in a fair manner. In particular, using this functionality, user is able to push out "hot" files on any high-performance device (e.g. proxy device) and pin them there. = File Migration: API = /* * Migrate file to specified target device. * @fd: file descriptor * @idx: serial number of target device in the logical volume */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_MIGRATE_FILE; args.val = idx; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. After ioctl successful completion the file is not necessarily written to the target device! To make sure of it, call fsync(2) after successful ioctl completion, or open the file with O_SYNC flag before migration. COMMENT. File migration is serialized with brick removal procedures by the file system (to ensure we don't migrate data to a brick removed by a concurrent procedure). Interrupted file migration procedure should be completed in the next or current mount session (depending on the interrupt reason). = Set file immobile status: API = /* * Set file "immobile". * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_SET_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. The immobile status guarantees that any data block of that file won't migrate to another device-component of the logical volume. Note, however, that such block can be easily relocated within device where it currently resides (once the file system finds better location for that block, etc). NOTE: All balancing procedures, which complete device removal, will ignore "immobile" status of any file. After device removal successful completion all data blocks of "immobile" files will be relocated to the remaining devices in accordance with current distribution policy. NOTE: Any selective file migration described above will ignore "immobile" status of the file! So the "immobile" status is honored only by volume balancing procedures, completing some operations such as adding a device to the logical volume, changing capacity of some device or flushing a proxy device. = Clear File immobile status: API = /* * Clear file "immobile" status. * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_CLR_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); NOTE: Selective file migration can make your distribution unfair! Currently it is strongly recommended to migrate files only to devices, which don't participate in regular data distribution e.g. to proxy brick, or to meta-data brick (on condition that it doesn't participate in regular data distribution). In the future it will be possible to turn off builtin distribution on any volume. in this case user will be responsible for appointing a destination device for any file on that volume. = File migration by volume.reiser4 tool = You can use volume.reiser4(8) utility for file migration as well as for setting/clearing file "immobile" status. To migrate a regular file just execute #volume.reiser4 -m N FILENAME where N is serial number of target device (i.e. device, that the file is supposed to migrate to), FILENAME is name of the file to migrate. To set immobile status simply execute #volume.reiser4 -i FILENAME To clear immobile status: #volume.reiser4 -e FILENAME = Restore regular distribution on your logical volume = By default each data stripe in a logical is a subject for a regular distribution policy, which provides fair distribution among all bricks. By migrating a file, user violates the fairness of distribution, which can result in losing space usage efficiency on your logical volume, unexpected ENOSPC errors, etc. And it can happen that user will want to restore regular fair distribution on his logical volume. It can be done by running volume.reiser4 utility with the option -S (--restore-regular) on that volume: #volume.reiser4 -S MOUNTPOINT Actually, it looks like usual balancing, which scans the volume and migrates data of each file, cleaning up its "immobile" status. = Holding "hot" files on Proxy Device = It makes sense to relocate data of "hot" files to one, or more devices, which have the highest performance in the logical volume, e.g. to [https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration proxy device]. For this you will need to mark every such file as "immobile" and move it to the desired device, so that balancing procedures (including flushing a proxy device) will ignore those files. See Appendix below for example. = Examples = In this example we'll move a file to 1) proxy and 2) regular data brick and pin it there. Create ID of logical volume: # VOL_ID=`uuidgen` Prepare 2 bricks for our logical volume, /dev/vdc2 for meta-data brick and /dev/vdc3 for proxy-device: # DEV1=/dev/vdc2 # DEV2=/dev/vdc3 # mkfs.reiser4 -U $VOL_ID -y -t 256K $DEV1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K $DEV2 Mount a logical volume consisting of one meta-data brick: # MNT=/mnt/test # mount $DEV1 $MNT Add proxy-device to the logical volume # volume.reiser4 -x $DEV2 $MNT Create a 400K file (100 logical blocks) on our logical volume: # dd if=/dev/zero of=${MNT}/myfile bs=4K count=100 # sync Print all bricks: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 contains 100 data blocks (blocks used - system blocks) = 215 - 115 Flush proxy device: # volume.reiser4 -b $MNT Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 115 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, all 100 data blocks were migrated to the meta-data brick /dev/vdc2 (block used = system blocks + data blocks + meta-data blocks = 115 + 100 + 1 = 216) Mark myfile as immobile and migrate it to the proxy-device: # volume.reiser4 -i ${MNT}/myfile # volume.reiser4 -m 1 ${MNT}/myfile Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 again contains all the data blocks. NOTE: file was migrated in spite of immobile status, because selective migration ignores that status. Now flush proxy device and make sure that the file remains on the proxy device: # volume.reiser4 -b $MNT # sync # volume.reiser4 $MNT -p0 # volume.reiser4 $MNT -p1 As we can see, flushing procedure respects immobile status. Finally, remove the proxy device from the logical volume: # volume.reiser4 -r $DEV2 $MNT Print the single remaining brick of our logical volume: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, file was migrated to the remaining brick /dev/vdc2 in spite of its immobile status. This is because operation of removing a device ignores that status. NOTE: the file remains immobile! Now add /dev/vdc3 as regular device (not proxy) and move the file to that device: # volume.reiser4 -a $DEV2 $MNT # volume.reiser4 -m 1 ${MNT}/myfile Print info about all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, all data blocks of the file now reside at /dev/vdc3 = FAQ = Q: How to find out serial number of device /dev/sdc1 in my logical volume mounted at /mnt A: Find out total number of devices in your logical volume, executing "volume.reiser4 /mnt". Then print all volume components by executing "volume.reiser4 /mnt -p i" in a loop for i = 0,.., N-1, where N - number of devices in your logical volume. Find out, which i is corresponding to /dev/sdc1. If you find this too complicated, feel free to send a patch for more simple procedure of serial number calculation :) bc456c17df5191a807456af79ba0e3890b0e4f3c 4439 4438 2020-11-12T16:11:49Z Edward 4 /* Restore regular distribution on your logical volume */ Migration of data blocks in a logical volumes cah happen not only in the context of volume balancing procedure, aiming to keep fairness of distribution on the whole logical volume. Reiser5 allows user to migrate data of any specified file to any specified brick of the logical volume. Also, user can mark any regular file as "immobile", so that volume balancing procedures will ignore that file. Moreover, user can clear "immobile" status of any specified file, so that the file will be movable again by volume balancing procedures. Finally, user can run a special procedure, which will clear "immobile" status of all files on the volume and distribute their data in a fair manner. In particular, using this functionality, user is able to push out "hot" files on any high-performance device (e.g. proxy device) and pin them there. = File Migration: API = /* * Migrate file to specified target device. * @fd: file descriptor * @idx: serial number of target device in the logical volume */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_MIGRATE_FILE; args.val = idx; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. After ioctl successful completion the file is not necessarily written to the target device! To make sure of it, call fsync(2) after successful ioctl completion, or open the file with O_SYNC flag before migration. COMMENT. File migration is serialized with brick removal procedures by the file system (to ensure we don't migrate data to a brick removed by a concurrent procedure). Interrupted file migration procedure should be completed in the next or current mount session (depending on the interrupt reason). = Set file immobile status: API = /* * Set file "immobile". * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_SET_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. The immobile status guarantees that any data block of that file won't migrate to another device-component of the logical volume. Note, however, that such block can be easily relocated within device where it currently resides (once the file system finds better location for that block, etc). NOTE: All balancing procedures, which complete device removal, will ignore "immobile" status of any file. After device removal successful completion all data blocks of "immobile" files will be relocated to the remaining devices in accordance with current distribution policy. NOTE: Any selective file migration described above will ignore "immobile" status of the file! So the "immobile" status is honored only by volume balancing procedures, completing some operations such as adding a device to the logical volume, changing capacity of some device or flushing a proxy device. = Clear File immobile status: API = /* * Clear file "immobile" status. * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_CLR_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); NOTE: Selective file migration can make your distribution unfair! Currently it is strongly recommended to migrate files only to devices, which don't participate in regular data distribution e.g. to proxy brick, or to meta-data brick (on condition that it doesn't participate in regular data distribution). In the future it will be possible to turn off builtin distribution on any volume. in this case user will be responsible for appointing a destination device for any file on that volume. = File migration by volume.reiser4 tool = You can use volume.reiser4(8) utility for file migration as well as for setting/clearing file "immobile" status. To migrate a regular file just execute #volume.reiser4 -m N FILENAME where N is serial number of target device (i.e. device, that the file is supposed to migrate to), FILENAME is name of the file to migrate. To set immobile status simply execute #volume.reiser4 -i FILENAME To clear immobile status: #volume.reiser4 -e FILENAME = Restore regular distribution on your logical volume = By default each data stripe in a logical is a subject for a regular distribution policy, which provides fair distribution among all bricks. By migrating a file, user violates the fairness of distribution, which can result in losing space usage efficiency on your logical volume, unexpected ENOSPC errors, etc. And it can happen that user will want to restore regular fair distribution on his logical volume. It can be done by running volume.reiser4 utility with the option -S (--restore-regular) on that volume: #volume.reiser4 -S MOUNTPOINT Actually, it looks like usual balancing, which scans the volume and migrates data of each file, cleaning up its "immobile" status. = Holding "hot" files on Proxy Device = It makes sense to relocate data of "hot" files to one, or more devices, which have the highest performance in the logical volume, e.g. to proxy device. For this you will need to mark every such file as "immobile" and move it to the desired device, so that balancing procedures (including flushing a proxy device) will ignore those files. See Appendix below for example. = Examples = In this example we'll move a file to 1) proxy and 2) regular data brick and pin it there. Create ID of logical volume: # VOL_ID=`uuidgen` Prepare 2 bricks for our logical volume, /dev/vdc2 for meta-data brick and /dev/vdc3 for proxy-device: # DEV1=/dev/vdc2 # DEV2=/dev/vdc3 # mkfs.reiser4 -U $VOL_ID -y -t 256K $DEV1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K $DEV2 Mount a logical volume consisting of one meta-data brick: # MNT=/mnt/test # mount $DEV1 $MNT Add proxy-device to the logical volume # volume.reiser4 -x $DEV2 $MNT Create a 400K file (100 logical blocks) on our logical volume: # dd if=/dev/zero of=${MNT}/myfile bs=4K count=100 # sync Print all bricks: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 contains 100 data blocks (blocks used - system blocks) = 215 - 115 Flush proxy device: # volume.reiser4 -b $MNT Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 115 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, all 100 data blocks were migrated to the meta-data brick /dev/vdc2 (block used = system blocks + data blocks + meta-data blocks = 115 + 100 + 1 = 216) Mark myfile as immobile and migrate it to the proxy-device: # volume.reiser4 -i ${MNT}/myfile # volume.reiser4 -m 1 ${MNT}/myfile Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 again contains all the data blocks. NOTE: file was migrated in spite of immobile status, because selective migration ignores that status. Now flush proxy device and make sure that the file remains on the proxy device: # volume.reiser4 -b $MNT # sync # volume.reiser4 $MNT -p0 # volume.reiser4 $MNT -p1 As we can see, flushing procedure respects immobile status. Finally, remove the proxy device from the logical volume: # volume.reiser4 -r $DEV2 $MNT Print the single remaining brick of our logical volume: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, file was migrated to the remaining brick /dev/vdc2 in spite of its immobile status. This is because operation of removing a device ignores that status. NOTE: the file remains immobile! Now add /dev/vdc3 as regular device (not proxy) and move the file to that device: # volume.reiser4 -a $DEV2 $MNT # volume.reiser4 -m 1 ${MNT}/myfile Print info about all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, all data blocks of the file now reside at /dev/vdc3 = FAQ = Q: How to find out serial number of device /dev/sdc1 in my logical volume mounted at /mnt A: Find out total number of devices in your logical volume, executing "volume.reiser4 /mnt". Then print all volume components by executing "volume.reiser4 /mnt -p i" in a loop for i = 0,.., N-1, where N - number of devices in your logical volume. Find out, which i is corresponding to /dev/sdc1. If you find this too complicated, feel free to send a patch for more simple procedure of serial number calculation :) 6e37d0f7089627db18860942c937c80462357033 4438 4437 2020-11-12T16:00:59Z Edward 4 Migration of data blocks in a logical volumes cah happen not only in the context of volume balancing procedure, aiming to keep fairness of distribution on the whole logical volume. Reiser5 allows user to migrate data of any specified file to any specified brick of the logical volume. Also, user can mark any regular file as "immobile", so that volume balancing procedures will ignore that file. Moreover, user can clear "immobile" status of any specified file, so that the file will be movable again by volume balancing procedures. Finally, user can run a special procedure, which will clear "immobile" status of all files on the volume and distribute their data in a fair manner. In particular, using this functionality, user is able to push out "hot" files on any high-performance device (e.g. proxy device) and pin them there. = File Migration: API = /* * Migrate file to specified target device. * @fd: file descriptor * @idx: serial number of target device in the logical volume */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_MIGRATE_FILE; args.val = idx; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. After ioctl successful completion the file is not necessarily written to the target device! To make sure of it, call fsync(2) after successful ioctl completion, or open the file with O_SYNC flag before migration. COMMENT. File migration is serialized with brick removal procedures by the file system (to ensure we don't migrate data to a brick removed by a concurrent procedure). Interrupted file migration procedure should be completed in the next or current mount session (depending on the interrupt reason). = Set file immobile status: API = /* * Set file "immobile". * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_SET_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. The immobile status guarantees that any data block of that file won't migrate to another device-component of the logical volume. Note, however, that such block can be easily relocated within device where it currently resides (once the file system finds better location for that block, etc). NOTE: All balancing procedures, which complete device removal, will ignore "immobile" status of any file. After device removal successful completion all data blocks of "immobile" files will be relocated to the remaining devices in accordance with current distribution policy. NOTE: Any selective file migration described above will ignore "immobile" status of the file! So the "immobile" status is honored only by volume balancing procedures, completing some operations such as adding a device to the logical volume, changing capacity of some device or flushing a proxy device. = Clear File immobile status: API = /* * Clear file "immobile" status. * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_CLR_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); NOTE: Selective file migration can make your distribution unfair! Currently it is strongly recommended to migrate files only to devices, which don't participate in regular data distribution e.g. to proxy brick, or to meta-data brick (on condition that it doesn't participate in regular data distribution). In the future it will be possible to turn off builtin distribution on any volume. in this case user will be responsible for appointing a destination device for any file on that volume. = File migration by volume.reiser4 tool = You can use volume.reiser4(8) utility for file migration as well as for setting/clearing file "immobile" status. To migrate a regular file just execute #volume.reiser4 -m N FILENAME where N is serial number of target device (i.e. device, that the file is supposed to migrate to), FILENAME is name of the file to migrate. To set immobile status simply execute #volume.reiser4 -i FILENAME To clear immobile status: #volume.reiser4 -e FILENAME = Restore regular distribution on your logical volume = By default each data stripe in a logical is a subject for a regular distribution policy, which provides fair distribution among all bricks By migrating a file, user violates the fairness of distribution, which can result in losing space usage efficiency on your logical volume, unexpected ENOSPC errors, etc. And it can happen that user will want to restore regular fair distribution on his logical volume. It can be done by running volume.reiser4 utility with the option -S (--restore-regular) on that volume: #volume.reiser4 -S MOUNTPOINT Actually, it looks like usual balancing, which scans the volume and migrates data of each file, cleaning up its "immobile" status. = Holding "hot" files on Proxy Device = It makes sense to relocate data of "hot" files to one, or more devices, which have the highest performance in the logical volume, e.g. to proxy device. For this you will need to mark every such file as "immobile" and move it to the desired device, so that balancing procedures (including flushing a proxy device) will ignore those files. See Appendix below for example. = Examples = In this example we'll move a file to 1) proxy and 2) regular data brick and pin it there. Create ID of logical volume: # VOL_ID=`uuidgen` Prepare 2 bricks for our logical volume, /dev/vdc2 for meta-data brick and /dev/vdc3 for proxy-device: # DEV1=/dev/vdc2 # DEV2=/dev/vdc3 # mkfs.reiser4 -U $VOL_ID -y -t 256K $DEV1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K $DEV2 Mount a logical volume consisting of one meta-data brick: # MNT=/mnt/test # mount $DEV1 $MNT Add proxy-device to the logical volume # volume.reiser4 -x $DEV2 $MNT Create a 400K file (100 logical blocks) on our logical volume: # dd if=/dev/zero of=${MNT}/myfile bs=4K count=100 # sync Print all bricks: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 contains 100 data blocks (blocks used - system blocks) = 215 - 115 Flush proxy device: # volume.reiser4 -b $MNT Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 115 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, all 100 data blocks were migrated to the meta-data brick /dev/vdc2 (block used = system blocks + data blocks + meta-data blocks = 115 + 100 + 1 = 216) Mark myfile as immobile and migrate it to the proxy-device: # volume.reiser4 -i ${MNT}/myfile # volume.reiser4 -m 1 ${MNT}/myfile Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 again contains all the data blocks. NOTE: file was migrated in spite of immobile status, because selective migration ignores that status. Now flush proxy device and make sure that the file remains on the proxy device: # volume.reiser4 -b $MNT # sync # volume.reiser4 $MNT -p0 # volume.reiser4 $MNT -p1 As we can see, flushing procedure respects immobile status. Finally, remove the proxy device from the logical volume: # volume.reiser4 -r $DEV2 $MNT Print the single remaining brick of our logical volume: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, file was migrated to the remaining brick /dev/vdc2 in spite of its immobile status. This is because operation of removing a device ignores that status. NOTE: the file remains immobile! Now add /dev/vdc3 as regular device (not proxy) and move the file to that device: # volume.reiser4 -a $DEV2 $MNT # volume.reiser4 -m 1 ${MNT}/myfile Print info about all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, all data blocks of the file now reside at /dev/vdc3 = FAQ = Q: How to find out serial number of device /dev/sdc1 in my logical volume mounted at /mnt A: Find out total number of devices in your logical volume, executing "volume.reiser4 /mnt". Then print all volume components by executing "volume.reiser4 /mnt -p i" in a loop for i = 0,.., N-1, where N - number of devices in your logical volume. Find out, which i is corresponding to /dev/sdc1. If you find this too complicated, feel free to send a patch for more simple procedure of serial number calculation :) b79c96dce861ff0405f65a9769929b4b1f8f608e 4437 2020-11-12T15:52:55Z Edward 4 Added "Transparent File Migration" page Migration of data blocks in a logical volumes cah happen not only in the context of volume balancing procedure, aiming to keep fairness of distribution on the whole logical volume. Reiser5 allows user to migrate data of any specified file to any spesified brick of the logical volume. Also, user can mark any regular file as "immobile", so that volume balancing procedures will ignore that file. Moreover, user can clear "immobile" status of any specified file, so that the file will be movable again by volume balancing procedures. Finally, user can run a special procedure, which will clear "immobile" status of all files on the volume and distribure their data in a fair manner. In particular, using this functionality, user is able to push out "hot" files on any high-performance device (e.g. proxy device) and pin them there. = File Migration: API = /* * Migrate file to specified target device. * @fd: file descriptor * @idx: serial number of target device in the logical volume */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_MIGRATE_FILE; args.val = idx; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. After ioctl successful completion the file is not necessarily written to the target device! To make sure of it, call fsync(2) after successful ioctl completion, or open the file with O_SYNC flag before migration. COMMENT. File migration is serialized with brick removal procedures (to ensure we don't migrate data to a brick removed by a concurrent procedure). Interrupted file migration procedures should be completed in the next or current mount session (depending on the interrupt reason). = Set file immobile status: API = /* * Set file "immobile". * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_SET_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); COMMENT. The immobile status guarantees that any data block of that file won't migrate to another device-component of the logical volume. Note, however, that such block can be easily relocated within device where it currently resides (once the file system finds better location for that block, etc). NOTE: All balancing procedures, which complete device removal, will ignore "immobile" status of any file. After device removal successful completion all data blocks of "immobile" files will be relocated to the remaining devices in accordance with current distribution policy. NOTE: Any selective file migration described above will ignore "immobile" status of the file! So the "immobile" status is honored only by volume balancing procedures, completing some operations such as adding a device to the logical volume, changing capacity of some device or flushing a proxy device. = Clear File immobile status: API = /* * Clear file "immobile" status. * @fd: file descriptor */ /* * Provide correct path here. * This header file can be found in reiser4 kernel module, or * reiser4progs sources */ #include "reiser4/ioctl.h" struct reiser4_vol_op_args args; memset(&args, 0, sizeof(args)); args.opcode = REISER4_CLR_FILE_IMMOBILE; result = ioctl(fd, REISER4_IOC_VOLUME, &args); NOTE: Selective file migration can make your distribution unfair! Currently it is strongly recommended to migrate files only to devices, which don't participate in regular data distribution e.g. to proxy brick, or to meta-data brick (on condition that it doesn't participate in regular data distribution). In the future it will be possible to turn off builtin distribution on any volume. in this case user will be responsible for appointing a destination device for any file on that volume. = File migration by volume.reiser4 tool = You can use volume.reiser4(8) utility for file migration as well as for setting/clearing file "immobile" status. To migrate a regular file just execute #volume.reiser4 -m N FILENAME where N is serial number of target device (i.e. device, that the file is supposed to migrate to), FILENAME is name of the file to migrate. To set immobile status simply execute #volume.reiser4 -i FILENAME To clear immobile status: #volume.reiser4 -e FILENAME = Restore regular distribution on your logical volume = By default each data stripe in a logical is a subject for a regular distribution policy, which provides fair distribution among all bricks By migrating a file, user violates the fairness of distribution, which can result in losing space usage efficiency on your logical volume, unexpected ENOSPC errors, etc. And it can happen that user will want to restore regular fair distribution on his logical volume. It can be done by running volume.reiser4 utility with the option -S (--restore-regular) on that volume: #volume.reiser4 -S MOUNTPOINT Actually, it looks like usual balancing, which scans the volume and migrates data of each file, cleaning up its "immpobile" status. = Holding "hot" files on Proxy Device = It makes sense to relocate data of "hot" files to one, or more devices, which have the highest performance in the logical volume, e.g. to proxy device. For this you will need to mark every such file as "immobile" and move it to the desired device, so that balancing procedures (including flushing a proxy device) will ignore those files. See Appendix below for example. = Examples = In this example we'll move a file to 1) proxy and 2) regular data brick and pin it there. Create ID of logical volume: # VOL_ID=`uuidgen` Prepare 2 bricks for our logical volume, /dev/vdc2 for meta-data brick and /dev/vdc3 for proxy-device: # DEV1=/dev/vdc2 # DEV2=/dev/vdc3 # mkfs.reiser4 -U $VOL_ID -y -t 256K $DEV1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K $DEV2 Mount a logical volume consisting of one meta-data brick: # MNT=/mnt/test # mount $DEV1 $MNT Add proxy-device to the logical volume # volume.reiser4 -x $DEV2 $MNT Create a 400K file (100 logical blocks) on our logical volume: # dd if=/dev/zero of=${MNT}/myfile bs=4K count=100 # sync Print all bricks: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 contains 100 data blocks (blocks used - system blocks) = 215 - 115 Flush proxy device: # volume.reiser4 -b $MNT Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 115 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, all 100 data blocks were migrated to the meta-data brick /dev/vdc2 (block used = system blocks + data blocks + meta-data blocks = 115 + 100 + 1 = 216) Mark myfile as immobile and migrate it to the proxy-device: # volume.reiser4 -i ${MNT}/myfile # volume.reiser4 -m 1 ${MNT}/myfile Print all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: No is proxy: Yes As we can see, the proxy device /dev/vdc3 again contains all the data blocks. NOTE: file was migrated in spite of immobile status, because selective migration ignores that status. Now flush proxy device and make sure that the file remains on the proxy device: # volume.reiser4 -b $MNT # sync # volume.reiser4 $MNT -p0 # volume.reiser4 $MNT -p1 As we can see, flushing procedure respects immobile status. Finally, remove the proxy device from the logical volume: # volume.reiser4 -r $DEV2 $MNT Print the single remaining brick of our logical volume: # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 216 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, file was migrated to the remaining brick /dev/vdc2 in spite of its immobile status. This is because operation of removing a device ignores that status. NOTE: the file remains immobile! Now add /dev/vdc3 as regular device (not proxy) and move the file to that device: # volume.reiser4 -a $DEV2 $MNT # volume.reiser4 -m 1 ${MNT}/myfile Print info about all bricks: # sync # volume.reiser4 $MNT -p0 Brick Info: internal ID: 0 (meta-data brick) external ID: 6ee9927e-04c3-4683-a451-f1329de66222 device name: /dev/vdc2 num replicas: 0 block count: 2621440 blocks used: 116 system blocks: 115 data capacity: 1843119 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No # volume.reiser4 $MNT -p1 Brick Info: internal ID: 1 (data brick) external ID: 2cc41c8a-b3cd-4690-b3fc-bd840e067131 device name: /dev/vdc3 num replicas: 0 block count: 2621440 blocks used: 215 system blocks: 115 data capacity: 2621325 space usage: 0.000 volinfo addr: 0 (none) in DSA: Yes is proxy: No As we can see, all data blocks of the file now reside at /dev/vdc3 = FAQ = Q: How to find out serial number of device /dev/sdc1 in my logical volume mounted at /mnt A: Find out total number of devices in your logical volume, executing "volume.reiser4 /mnt". Then print all volume components by executing "volume.reiser4 /mnt -p i" in a loop for i = 0,.., N-1, where N - number of devices in your logical volume. Find out, which i is corresponding to /dev/sdc1. If you find this too complicated, feel free to send a patch for more simple procedure of serial number calculation :) 031fce406c0a70452b2872fd82e8f6d738ce8dcc Txn-doc 0 31 1737 1588 2010-04-25T04:27:03Z Chris goe 2 {{wayback|http://www.namesys.com/txn-doc.html|2006-11-13}} Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev == Summary == Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash. In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor. Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The [http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout WAFL] (Write Anywhere File Layout) technique [[#References|Hitz94]] handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically. In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method. == Definition of Atomicity == Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state. Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events: * Process 1 writes file A * Process 2 reads file A * Process 2 writes file B At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash. By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements. A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./ There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency. A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later). Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur. Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future. Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes. == Stage One: Capturing and Fusing == The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request. A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section. The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. (An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.) The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits. We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit. For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms. Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point. Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM. To finish stage one we have: * The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated. * The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit. * Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps. * Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record. * The final commit record is formatted now, although it is not needed until stage three. == Stage Two: Completing Writes == At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately. The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced. A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency. A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes. If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes. === Overwrite Blocks === The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location. Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom. A write-capture request is granted by copying the block using the copy-on-capture method described above. === Issues === One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again. == Stage Three: Commit == When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log. === Overwrite Block-Writes === Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture. The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. When an overwrite block-write finishes the block is released without further processing. === Deallocate-Set Processing === April 2002 note: We are revising our strategy for bitmap handling. The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time. For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected. For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section. == Stage Five: Commit Completion == When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed. === Wandered Overwrite-Block Allocation === Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set. For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set. Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed. == Stage Six: Deallocating Wandered Blocks == Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set. At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete. This completes the life of an atom. Now we must discuss several special topics. == Reserving Space == The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies. Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions: # Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided. #: # Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state. #: # Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system. #: # Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation. We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing. To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity. All of these options will be configurable on a per-file system basis: * (1) how much emergency space to reserve (e.g., 5% of disk space) and * (2) whether to fail the operation or crash the system when reserves are exceeded. == Write Atomicity Options == The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call. Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements. We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units. == Crash Recovery Algorithm == April 2002 note: We are revising our strategy for crash recovery of bitmaps. Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that: * (1) all overwrites are complete and * (2) all blocks have been deallocated. We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points. Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins. Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well. The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block. For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm. == Relocation and Fragmentation == As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance. The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk. The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction. Since system and application workloads vary, we will support several relocation policies: * Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk. * Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging. * Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially. Much more complex optimizations are possible, but deferred for a later release. Unlike WAFL, we expect the use of a repacker to play an important role. == Meta-Data Journaling == Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest. == Bitmap Blocks Special Handling == Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.) We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP. The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit. We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called). Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block. Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks). Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added. == References == * [http://www.usenix.org/publications/library/proceedings/sf94/hitz.html Hitz94], Dave Hitz, James Lau, and Michael Malcolm, "[http://www.netapp.com/library/tr/3002.pdf File System Design for an NFS File Server Appliance]", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246. [[category:Reiser4]] fa9e873a2ae02c6990f813ac4d692e9b9acc2aa3 1588 1587 2009-07-06T01:38:21Z Chris goe 2 #ref Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev == Summary == Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash. In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor. Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The [http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout WAFL] (Write Anywhere File Layout) technique [[#References|Hitz94]] handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically. In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method. == Definition of Atomicity == Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state. Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events: * Process 1 writes file A * Process 2 reads file A * Process 2 writes file B At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash. By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements. A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./ There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency. A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later). Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur. Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future. Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes. == Stage One: Capturing and Fusing == The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request. A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section. The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. (An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.) The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits. We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit. For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms. Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point. Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM. To finish stage one we have: * The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated. * The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit. * Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps. * Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record. * The final commit record is formatted now, although it is not needed until stage three. == Stage Two: Completing Writes == At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately. The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced. A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency. A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes. If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes. === Overwrite Blocks === The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location. Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom. A write-capture request is granted by copying the block using the copy-on-capture method described above. === Issues === One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again. == Stage Three: Commit == When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log. === Overwrite Block-Writes === Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture. The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. When an overwrite block-write finishes the block is released without further processing. === Deallocate-Set Processing === April 2002 note: We are revising our strategy for bitmap handling. The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time. For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected. For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section. == Stage Five: Commit Completion == When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed. === Wandered Overwrite-Block Allocation === Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set. For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set. Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed. == Stage Six: Deallocating Wandered Blocks == Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set. At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete. This completes the life of an atom. Now we must discuss several special topics. == Reserving Space == The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies. Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions: # Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided. #: # Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state. #: # Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system. #: # Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation. We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing. To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity. All of these options will be configurable on a per-file system basis: * (1) how much emergency space to reserve (e.g., 5% of disk space) and * (2) whether to fail the operation or crash the system when reserves are exceeded. == Write Atomicity Options == The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call. Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements. We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units. == Crash Recovery Algorithm == April 2002 note: We are revising our strategy for crash recovery of bitmaps. Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that: * (1) all overwrites are complete and * (2) all blocks have been deallocated. We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points. Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins. Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well. The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block. For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm. == Relocation and Fragmentation == As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance. The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk. The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction. Since system and application workloads vary, we will support several relocation policies: * Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk. * Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging. * Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially. Much more complex optimizations are possible, but deferred for a later release. Unlike WAFL, we expect the use of a repacker to play an important role. == Meta-Data Journaling == Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest. == Bitmap Blocks Special Handling == Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.) We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP. The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit. We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called). Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block. Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks). Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added. == References == * [http://www.usenix.org/publications/library/proceedings/sf94/hitz.html Hitz94], Dave Hitz, James Lau, and Michael Malcolm, "[http://www.netapp.com/library/tr/3002.pdf File System Design for an NFS File Server Appliance]", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246. [[category:Reiser4]] 705855ab973f6442819a29ea67987d93bce07546 1587 1586 2009-07-06T01:37:28Z Chris goe 2 ordered list added Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev == Summary == Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash. In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor. Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The [http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout WAFL] (Write Anywhere File Layout) technique [Hitz94] handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically. In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method. == Definition of Atomicity == Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state. Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events: * Process 1 writes file A * Process 2 reads file A * Process 2 writes file B At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash. By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements. A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./ There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency. A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later). Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur. Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future. Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes. == Stage One: Capturing and Fusing == The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request. A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section. The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. (An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.) The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits. We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit. For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms. Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point. Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM. To finish stage one we have: * The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated. * The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit. * Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps. * Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record. * The final commit record is formatted now, although it is not needed until stage three. == Stage Two: Completing Writes == At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately. The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced. A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency. A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes. If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes. === Overwrite Blocks === The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location. Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom. A write-capture request is granted by copying the block using the copy-on-capture method described above. === Issues === One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again. == Stage Three: Commit == When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log. === Overwrite Block-Writes === Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture. The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. When an overwrite block-write finishes the block is released without further processing. === Deallocate-Set Processing === April 2002 note: We are revising our strategy for bitmap handling. The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time. For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected. For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section. == Stage Five: Commit Completion == When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed. === Wandered Overwrite-Block Allocation === Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set. For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set. Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed. == Stage Six: Deallocating Wandered Blocks == Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set. At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete. This completes the life of an atom. Now we must discuss several special topics. == Reserving Space == The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies. Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions: # Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided. #: # Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state. #: # Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system. #: # Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation. We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing. To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity. All of these options will be configurable on a per-file system basis: * (1) how much emergency space to reserve (e.g., 5% of disk space) and * (2) whether to fail the operation or crash the system when reserves are exceeded. == Write Atomicity Options == The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call. Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements. We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units. == Crash Recovery Algorithm == April 2002 note: We are revising our strategy for crash recovery of bitmaps. Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that: * (1) all overwrites are complete and * (2) all blocks have been deallocated. We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points. Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins. Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well. The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block. For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm. == Relocation and Fragmentation == As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance. The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk. The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction. Since system and application workloads vary, we will support several relocation policies: * Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk. * Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging. * Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially. Much more complex optimizations are possible, but deferred for a later release. Unlike WAFL, we expect the use of a repacker to play an important role. == Meta-Data Journaling == Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest. == Bitmap Blocks Special Handling == Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.) We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP. The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit. We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called). Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block. Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks). Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added. == References == * [http://www.usenix.org/publications/library/proceedings/sf94/hitz.html Hitz94], Dave Hitz, James Lau, and Michael Malcolm, "[http://www.netapp.com/library/tr/3002.pdf File System Design for an NFS File Server Appliance]", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246. [[category:Reiser4]] a18390dd100c5d1e44eebb8b7604575e9b8b9ea1 1586 1585 2009-07-06T01:33:10Z Chris goe 2 url fixed Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev == Summary == Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash. In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor. Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The [http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout WAFL] (Write Anywhere File Layout) technique [Hitz94] handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically. In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method. == Definition of Atomicity == Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state. Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events: * Process 1 writes file A * Process 2 reads file A * Process 2 writes file B At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash. By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements. A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./ There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency. A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later). Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur. Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future. Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes. == Stage One: Capturing and Fusing == The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request. A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section. The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. (An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.) The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits. We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit. For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms. Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point. Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM. To finish stage one we have: * The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated. * The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit. * Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps. * Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record. * The final commit record is formatted now, although it is not needed until stage three. == Stage Two: Completing Writes == At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately. The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced. A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency. A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes. If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes. === Overwrite Blocks === The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location. Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom. A write-capture request is granted by copying the block using the copy-on-capture method described above. === Issues === One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again. == Stage Three: Commit == When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log. === Overwrite Block-Writes === Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture. The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. When an overwrite block-write finishes the block is released without further processing. === Deallocate-Set Processing === April 2002 note: We are revising our strategy for bitmap handling. The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time. For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected. For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section. == Stage Five: Commit Completion == When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed. === Wandered Overwrite-Block Allocation === Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set. For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set. Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed. == Stage Six: Deallocating Wandered Blocks == Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set. At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete. This completes the life of an atom. Now we must discuss several special topics. == Reserving Space == The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies. Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions: 1. Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided. 2. Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state. 3. Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system. 4. Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation. We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing. To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity. All of these options will be configurable on a per-file system basis: * (1) how much emergency space to reserve (e.g., 5% of disk space) and * (2) whether to fail the operation or crash the system when reserves are exceeded. == Write Atomicity Options == The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call. Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements. We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units. == Crash Recovery Algorithm == April 2002 note: We are revising our strategy for crash recovery of bitmaps. Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that: * (1) all overwrites are complete and * (2) all blocks have been deallocated. We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points. Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins. Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well. The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block. For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm. == Relocation and Fragmentation == As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance. The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk. The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction. Since system and application workloads vary, we will support several relocation policies: * Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk. * Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging. * Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially. Much more complex optimizations are possible, but deferred for a later release. Unlike WAFL, we expect the use of a repacker to play an important role. == Meta-Data Journaling == Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest. == Bitmap Blocks Special Handling == Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.) We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP. The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit. We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called). Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block. Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks). Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added. == References == * [http://www.usenix.org/publications/library/proceedings/sf94/hitz.html Hitz94], Dave Hitz, James Lau, and Michael Malcolm, "[http://www.netapp.com/library/tr/3002.pdf File System Design for an NFS File Server Appliance]", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246. [[category:Reiser4]] 385427e6b872a1fb6a5e80c5b5ccb2cf57dffa6e 1585 1524 2009-07-06T01:30:28Z Chris goe 2 formatting fixes Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev == Summary == Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash. In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor. Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The [http://en.wikipedia.org/wiki/Write_Anywhere_File_Layout WAFL] (Write Anywhere File Layout) technique [Hitz94] handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically. In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method. == Definition of Atomicity == Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state. Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events: * Process 1 writes file A * Process 2 reads file A * Process 2 writes file B At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash. By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements. A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./ There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency. A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later). Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur. Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future. Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes. == Stage One: Capturing and Fusing == The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request. A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section. The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. (An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.) The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits. We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit. For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms. Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point. Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM. To finish stage one we have: * The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated. * The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit. * Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps. * Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record. * The final commit record is formatted now, although it is not needed until stage three. == Stage Two: Completing Writes == At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately. The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced. A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency. A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes. If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes. === Overwrite Blocks === The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location. Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom. A write-capture request is granted by copying the block using the copy-on-capture method described above. === Issues === One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again. == Stage Three: Commit == When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log. === Overwrite Block-Writes === Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture. The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. When an overwrite block-write finishes the block is released without further processing. === Deallocate-Set Processing === April 2002 note: We are revising our strategy for bitmap handling. The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time. For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected. For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section. == Stage Five: Commit Completion == When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed. === Wandered Overwrite-Block Allocation === Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set. For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set. Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed. == Stage Six: Deallocating Wandered Blocks == Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set. At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete. This completes the life of an atom. Now we must discuss several special topics. == Reserving Space == The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies. Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions: 1. Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided. 2. Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state. 3. Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system. 4. Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation. We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing. To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity. All of these options will be configurable on a per-file system basis: * (1) how much emergency space to reserve (e.g., 5% of disk space) and * (2) whether to fail the operation or crash the system when reserves are exceeded. == Write Atomicity Options == The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call. Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements. We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units. == Crash Recovery Algorithm == April 2002 note: We are revising our strategy for crash recovery of bitmaps. Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that: * (1) all overwrites are complete and * (2) all blocks have been deallocated. We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points. Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins. Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well. The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block. For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm. == Relocation and Fragmentation == As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance. The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk. The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction. Since system and application workloads vary, we will support several relocation policies: * Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk. * Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging. * Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially. Much more complex optimizations are possible, but deferred for a later release. Unlike WAFL, we expect the use of a repacker to play an important role. == Meta-Data Journaling == Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest. == Bitmap Blocks Special Handling == Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.) We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP. The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit. We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called). Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block. Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks). Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added. == References == * [http://www.usenix.org/publications/library/proceedings/usits97/full_papers/christenson/christenson_html/christenson.html#Hitz94 Hitz94], Dave Hitz, James Lau, and Michael Malcolm, "[http://www.netapp.com/library/tr/3002.pdf File System Design for an NFS File Server Appliance]", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246. [[category:Reiser4]] 014e560a1813b4dce8ec36ced7210e41a2545211 1524 1369 2009-06-27T19:24:48Z Chris goe 2 formatting-fixes-needed Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev Summary Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash. In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor. Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The WAFL (Write Anywhere File Layout) technique [Hitz94] handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically. In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method. Definition of Atomicity Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state. Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events: * Process 1 writes file A * Process 2 reads file A * Process 2 writes file B At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash. By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements. A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./ There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency. A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later). Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur. Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future. Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes. Stage One: Capturing and Fusing The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request. A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section. The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. (An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.) The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits. We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit. For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms. Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point. Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM. To finish stage one we have: * The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated. * The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit. * Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps. * Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record. * The final commit record is formatted now, although it is not needed until stage three. Stage Two: Completing Writes At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately. The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced. A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency. A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes. If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes. Overwrite Blocks The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location. Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom. A write-capture request is granted by copying the block using the copy-on-capture method described above. Issues One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again. Stage Three: Commit When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log. Overwrite Block-Writes Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture. The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. When an overwrite block-write finishes the block is released without further processing. Deallocate-Set Processing April 2002 note: We are revising our strategy for bitmap handling. The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time. For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected. For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section. Stage Five: Commit Completion When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed. Wandered Overwrite-Block Allocation Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set. For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set. Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed. Stage Six: Deallocating Wandered Blocks Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set. At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete. This completes the life of an atom. Now we must discuss several special topics. Reserving Space The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies. Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions: 1. Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided. 2. Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state. 3. Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system. 4. Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation. We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing. To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity. All of these options will be configurable on a per-file system basis: (1) how much emergency space to reserve (e.g., 5% of disk space) and (2) whether to fail the operation or crash the system when reserves are exceeded. Write Atomicity Options The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call. Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements. We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units. Crash Recovery Algorithm April 2002 note: We are revising our strategy for crash recovery of bitmaps. Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that: (1) all overwrites are complete and (2) all blocks have been deallocated. We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points. Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins. Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well. The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block. For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm. Relocation and Fragmentation As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance. The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk. The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction. Since system and application workloads vary, we will support several relocation policies: * Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk. * Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging. * Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially. Much more complex optimizations are possible, but deferred for a later release. Unlike WAFL, we expect the use of a repacker to play an important role. Meta-Data Journaling Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest. Bitmap Blocks Special Handling Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.) We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP. The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit. We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called). Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block. Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks). Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added. References [Hitz94] Dave Hitz, James Lau, and Michael Malcolm, "File System Design for an NFS File Server Appliance", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246. [[category:Reiser4]] [[category:Formatting-fixes-needed]] 6a8b82535c7bbf94f3cd9da71a190116b461d9f0 1369 1340 2009-06-25T09:11:02Z Chris goe 2 category fixed Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev Summary Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash. In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor. Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The WAFL (Write Anywhere File Layout) technique [Hitz94] handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically. In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method. Definition of Atomicity Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state. Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events: * Process 1 writes file A * Process 2 reads file A * Process 2 writes file B At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash. By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements. A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./ There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency. A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later). Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur. Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future. Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes. Stage One: Capturing and Fusing The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request. A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section. The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. (An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.) The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits. We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit. For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms. Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point. Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM. To finish stage one we have: * The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated. * The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit. * Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps. * Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record. * The final commit record is formatted now, although it is not needed until stage three. Stage Two: Completing Writes At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately. The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced. A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency. A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes. If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes. Overwrite Blocks The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location. Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom. A write-capture request is granted by copying the block using the copy-on-capture method described above. Issues One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again. Stage Three: Commit When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log. Overwrite Block-Writes Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture. The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. When an overwrite block-write finishes the block is released without further processing. Deallocate-Set Processing April 2002 note: We are revising our strategy for bitmap handling. The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time. For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected. For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section. Stage Five: Commit Completion When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed. Wandered Overwrite-Block Allocation Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set. For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set. Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed. Stage Six: Deallocating Wandered Blocks Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set. At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete. This completes the life of an atom. Now we must discuss several special topics. Reserving Space The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies. Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions: 1. Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided. 2. Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state. 3. Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system. 4. Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation. We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing. To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity. All of these options will be configurable on a per-file system basis: (1) how much emergency space to reserve (e.g., 5% of disk space) and (2) whether to fail the operation or crash the system when reserves are exceeded. Write Atomicity Options The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call. Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements. We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units. Crash Recovery Algorithm April 2002 note: We are revising our strategy for crash recovery of bitmaps. Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that: (1) all overwrites are complete and (2) all blocks have been deallocated. We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points. Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins. Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well. The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block. For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm. Relocation and Fragmentation As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance. The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk. The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction. Since system and application workloads vary, we will support several relocation policies: * Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk. * Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging. * Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially. Much more complex optimizations are possible, but deferred for a later release. Unlike WAFL, we expect the use of a repacker to play an important role. Meta-Data Journaling Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest. Bitmap Blocks Special Handling Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.) We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP. The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit. We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called). Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block. Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks). Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added. References [Hitz94] Dave Hitz, James Lau, and Michael Malcolm, "File System Design for an NFS File Server Appliance", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246. [[category:Reiser4]] b1581f94ef54c862fa8f1a9c82a51ccd64d9916a 1340 2009-06-25T08:04:52Z Chris goe 2 Created page with 'Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev Summary Reiser4 will feature advanced new transaction capabilit...' Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev Summary Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash. In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor. Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The WAFL (Write Anywhere File Layout) technique [Hitz94] handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically. In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method. Definition of Atomicity Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state. Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events: * Process 1 writes file A * Process 2 reads file A * Process 2 writes file B At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash. By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements. A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./ There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency. A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later). Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur. Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future. Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes. Stage One: Capturing and Fusing The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request. A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section. The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. (An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.) The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits. We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit. For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms. Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point. Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM. To finish stage one we have: * The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated. * The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit. * Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps. * Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record. * The final commit record is formatted now, although it is not needed until stage three. Stage Two: Completing Writes At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately. The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced. A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency. A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes. If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes. Overwrite Blocks The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location. Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom. A write-capture request is granted by copying the block using the copy-on-capture method described above. Issues One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again. Stage Three: Commit When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log. Overwrite Block-Writes Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture. The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. When an overwrite block-write finishes the block is released without further processing. Deallocate-Set Processing April 2002 note: We are revising our strategy for bitmap handling. The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time. For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected. For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section. Stage Five: Commit Completion When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed. Wandered Overwrite-Block Allocation Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set. For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set. Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed. Stage Six: Deallocating Wandered Blocks Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set. At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete. This completes the life of an atom. Now we must discuss several special topics. Reserving Space The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies. Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions: 1. Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided. 2. Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state. 3. Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system. 4. Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation. We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing. To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity. All of these options will be configurable on a per-file system basis: (1) how much emergency space to reserve (e.g., 5% of disk space) and (2) whether to fail the operation or crash the system when reserves are exceeded. Write Atomicity Options The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call. Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements. We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units. Crash Recovery Algorithm April 2002 note: We are revising our strategy for crash recovery of bitmaps. Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that: (1) all overwrites are complete and (2) all blocks have been deallocated. We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points. Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins. Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well. The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block. For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm. Relocation and Fragmentation As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance. The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk. The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction. Since system and application workloads vary, we will support several relocation policies: * Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk. * Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging. * Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially. Much more complex optimizations are possible, but deferred for a later release. Unlike WAFL, we expect the use of a repacker to play an important role. Meta-Data Journaling Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest. Bitmap Blocks Special Handling Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.) We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP. The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit. We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called). Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block. Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks). Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added. References [Hitz94] Dave Hitz, James Lau, and Michael Malcolm, "File System Design for an NFS File Server Appliance", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246. 218c07f581b1c360c92303183ce520a0ca234a1b V4 0 19 4157 1751 2016-08-25T21:59:52Z Chris goe 2 [[containers]] linked {{wayback|http://www.namesys.com/v4/v4.html|2006-11-13}} Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of ReiserFS is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. = Software Engineering Based Reiser4 Design Principles = == Equal Source Code Access Is A Civil Right == Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). == Software Libre Takes More Than A License --- It Takes A Design == Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. == Why Limit Interactions With Objects Strictly? == You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. = Basic Semantics = To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. == Files == character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. === The Software Engineering Lurking Below File Plugins === Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. == Names and Objects == A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. == Ordering of Name Components == When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. == Directories == Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. === The Unix Directory Plugin === The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. ==== Some Historical Details Of Design Flaws In The Unix Directory Interface ==== Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. == Files That Are Also Directories == In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. == New Security Attributes and Set Theoretic Semantic Purity == character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. == Can We Get By Using Just Files and Directories == (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. === List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories === * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: # A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. # Each of the subtrees is a tree. # No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. # The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. == Fine Points of the Definition == The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. = Graphs vs. Trees = Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. == Choosing Which Subtree == We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. == Nodes == === Leaves, Twigs, and Branches === Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) === Size of Nodes === We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. == Items == Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data [[containers|container]] that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. == Units == We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). == Fanout == The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. == What Are B+Trees, and Why Are They Better than B-Trees == It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. = Cache Design Principles = == Reiser's Untie The Uncorrelated Principle of Cache Design == Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. == Reiser's Maximize The Variance Principle of Cache Design == If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. # If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. # If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. # Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. == Segregating By Temperature Directly == Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. == Dancing Trees Are Faster Than Balanced Trees == character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. = A Brief History Of How Filesystems Have Handled Crashes = == Filesystem Checkers == Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. == Fixed Location Journaling == A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. == Wandering Logs == One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations == Copy-on-capture == Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. == Key Assignment Plugins == When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) = Aggregation Is Best Implemented As Inheritance = In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html = Steps For Creating A Security Attribute = You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage [[containers]] of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. = Citations = * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] [[category:Formatting-fixes-needed]] 1d88e7c95c8d4afb767b81af7c1b1bbd672db043 1751 1750 2010-04-25T10:39:45Z Chris goe 2 . {{wayback|http://www.namesys.com/v4/v4.html|2006-11-13}} Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of ReiserFS is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. = Software Engineering Based Reiser4 Design Principles = == Equal Source Code Access Is A Civil Right == Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). == Software Libre Takes More Than A License --- It Takes A Design == Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. == Why Limit Interactions With Objects Strictly? == You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. = Basic Semantics = To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. == Files == character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. === The Software Engineering Lurking Below File Plugins === Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. == Names and Objects == A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. == Ordering of Name Components == When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. == Directories == Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. === The Unix Directory Plugin === The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. ==== Some Historical Details Of Design Flaws In The Unix Directory Interface ==== Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. == Files That Are Also Directories == In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. == New Security Attributes and Set Theoretic Semantic Purity == character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. == Can We Get By Using Just Files and Directories == (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. === List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories === * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: # A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. # Each of the subtrees is a tree. # No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. # The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. == Fine Points of the Definition == The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. = Graphs vs. Trees = Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. == Choosing Which Subtree == We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. == Nodes == === Leaves, Twigs, and Branches === Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) === Size of Nodes === We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. == Items == Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. == Units == We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). == Fanout == The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. == What Are B+Trees, and Why Are They Better than B-Trees == It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. = Cache Design Principles = == Reiser's Untie The Uncorrelated Principle of Cache Design == Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. == Reiser's Maximize The Variance Principle of Cache Design == If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. # If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. # If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. # Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. == Segregating By Temperature Directly == Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. == Dancing Trees Are Faster Than Balanced Trees == character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. = A Brief History Of How Filesystems Have Handled Crashes = == Filesystem Checkers == Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. == Fixed Location Journaling == A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. == Wandering Logs == One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations == Copy-on-capture == Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. == Key Assignment Plugins == When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) = Aggregation Is Best Implemented As Inheritance = In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html = Steps For Creating A Security Attribute = You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. = Citations = * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] [[category:Formatting-fixes-needed]] f8ae1b595eb60933b9545f3dfce605e4849dd509 1750 1749 2010-04-25T10:39:20Z Chris goe 2 paragraphs added {{wayback|http://www.namesys.com/v4/v4.html|2006-11-13}} Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of ReiserFS is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. = Software Engineering Based Reiser4 Design Principles = == Equal Source Code Access Is A Civil Right == Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). == Software Libre Takes More Than A License --- It Takes A Design == Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. == Why Limit Interactions With Objects Strictly? == You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. = Basic Semantics = To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. == Files == character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. === The Software Engineering Lurking Below File Plugins === Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. == Names and Objects == A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. == Ordering of Name Components == When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. == Directories == Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. === The Unix Directory Plugin === The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. ==== Some Historical Details Of Design Flaws In The Unix Directory Interface ==== Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. == Files That Are Also Directories == In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. == New Security Attributes and Set Theoretic Semantic Purity == character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. == Can We Get By Using Just Files and Directories == (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. === List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories === * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: # A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. # Each of the subtrees is a tree. # No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. # The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. == Fine Points of the Definition == The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. = Graphs vs. Trees = Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. == Choosing Which Subtree == We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. == Nodes == === Leaves, Twigs, and Branches === Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) === Size of Nodes === We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. == Items == Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. == Units == We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). == Fanout == The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. == What Are B+Trees, and Why Are They Better than B-Trees == It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. = Cache Design Principles = == Reiser's Untie The Uncorrelated Principle of Cache Design == Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. == Reiser's Maximize The Variance Principle of Cache Design == If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. # If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. # If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. # Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. == Segregating By Temperature Directly == Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. == Dancing Trees Are Faster Than Balanced Trees == character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. = A Brief History Of How Filesystems Have Handled Crashes == == Filesystem Checkers == Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. == Fixed Location Journaling == A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. == Wandering Logs == One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations == Copy-on-capture == Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. == Key Assignment Plugins == When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) = Aggregation Is Best Implemented As Inheritance = In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html = Steps For Creating A Security Attribute = You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. = Citations = * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] [[category:Formatting-fixes-needed]] 326c2ba919833eac1f9a42a3b455f4982cd93285 1749 1741 2010-04-25T10:28:40Z Chris goe 2 more cleanups to come... {{wayback|http://www.namesys.com/v4/v4.html|2006-11-13}} Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of ReiserFS is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. == Software Engineering Based Reiser4 Design Principles == === Equal Source Code Access Is A Civil Right === Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). === Software Libre Takes More Than A License --- It Takes A Design === Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. === Why Limit Interactions With Objects Strictly? === You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. == Basic Semantics == To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. === Files === character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. ==== The Software Engineering Lurking Below File Plugins ==== Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. === Names and Objects === A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. === Ordering of Name Components === When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. === Directories === Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. ==== The Unix Directory Plugin ==== The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. Some Historical Details Of Design Flaws In The Unix Directory Interface Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. Files That Are Also Directories In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. New Security Attributes and Set Theoretic Semantic Purity character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: # A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. # Each of the subtrees is a tree. # No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. # The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. Fine Points of the Definition The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. Graphs vs. Trees Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. Choosing Which Subtree We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. Nodes Leaves, Twigs, and Branches Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) Size of Nodes We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. Items Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. Units We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). Fanout The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. What Are B+Trees, and Why Are They Better than B-Trees It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. Cache Design Principles Reiser's Untie The Uncorrelated Principle of Cache Design Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. Reiser's Maximize The Variance Principle of Cache Design If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. # If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. # If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. # Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. Segregating By Temperature Directly Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. Dancing Trees Are Faster Than Balanced Trees character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. A Brief History Of How Filesystems Have Handled Crashes Filesystem Checkers Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. Fixed Location Journaling A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. Wandering Logs One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations Copy-on-capture Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. Key Assignment Plugins When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) Aggregation Is Best Implemented As Inheritance In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html Steps For Creating A Security Attribute You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. HOME Citations: * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] [[category:Formatting-fixes-needed]] b4ff52feceb46d327b54d045578b68be0642109e 1741 1730 2010-04-25T04:34:32Z Chris goe 2 wayback template used {{wayback|http://www.namesys.com/v4/v4.html|2006-11-13}} Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of ReiserFS is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. == Software Engineering Based Reiser4 Design Principles == === Equal Source Code Access Is A Civil Right === Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). === Software Libre Takes More Than A License --- It Takes A Design === Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. === Why Limit Interactions With Objects Strictly? === You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. == Basic Semantics == To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. === Files === character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. ==== The Software Engineering Lurking Below File Plugins ==== Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. === Names and Objects === A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. === Ordering of Name Components === When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. === Directories === Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. ==== The Unix Directory Plugin ==== The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. Some Historical Details Of Design Flaws In The Unix Directory Interface Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. Files That Are Also Directories In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. New Security Attributes and Set Theoretic Semantic Purity character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: 1. A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. 2. Each of the subtrees is a tree. 3. No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. 4. The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. Fine Points of the Definition The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. Graphs vs. Trees Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. Choosing Which Subtree We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. Nodes Leaves, Twigs, and Branches Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) Size of Nodes We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. Items Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. Units We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). Fanout The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. What Are B+Trees, and Why Are They Better than B-Trees It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. Cache Design Principles Reiser's Untie The Uncorrelated Principle of Cache Design Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. Reiser's Maximize The Variance Principle of Cache Design If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. 1. If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. 2. If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. 3. Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. Segregating By Temperature Directly Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. Dancing Trees Are Faster Than Balanced Trees character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. A Brief History Of How Filesystems Have Handled Crashes Filesystem Checkers Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. Fixed Location Journaling A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. Wandering Logs One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations Copy-on-capture Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. Key Assignment Plugins When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) Aggregation Is Best Implemented As Inheritance In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html Steps For Creating A Security Attribute You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. HOME Citations: * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] [[category:Formatting-fixes-needed]] 3f1a6d3a00ad8942b6226095d961c238e4be8d75 1730 1611 2010-04-25T04:19:30Z Chris goe 2 moved [[Reiser4]] to [[V4]]:&#32;We'll use the Reiser4 page for something else Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of ReiserFS is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. == Software Engineering Based Reiser4 Design Principles == === Equal Source Code Access Is A Civil Right === Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). === Software Libre Takes More Than A License --- It Takes A Design === Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. === Why Limit Interactions With Objects Strictly? === You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. == Basic Semantics == To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. === Files === character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. ==== The Software Engineering Lurking Below File Plugins ==== Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. === Names and Objects === A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. === Ordering of Name Components === When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. === Directories === Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. ==== The Unix Directory Plugin ==== The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. Some Historical Details Of Design Flaws In The Unix Directory Interface Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. Files That Are Also Directories In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. New Security Attributes and Set Theoretic Semantic Purity character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: 1. A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. 2. Each of the subtrees is a tree. 3. No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. 4. The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. Fine Points of the Definition The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. Graphs vs. Trees Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. Choosing Which Subtree We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. Nodes Leaves, Twigs, and Branches Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) Size of Nodes We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. Items Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. Units We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). Fanout The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. What Are B+Trees, and Why Are They Better than B-Trees It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. Cache Design Principles Reiser's Untie The Uncorrelated Principle of Cache Design Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. Reiser's Maximize The Variance Principle of Cache Design If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. 1. If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. 2. If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. 3. Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. Segregating By Temperature Directly Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. Dancing Trees Are Faster Than Balanced Trees character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. A Brief History Of How Filesystems Have Handled Crashes Filesystem Checkers Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. Fixed Location Journaling A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. Wandering Logs One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations Copy-on-capture Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. Key Assignment Plugins When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) Aggregation Is Best Implemented As Inheritance In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html Steps For Creating A Security Attribute You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. HOME Citations: * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] [[category:Formatting-fixes-needed]] 7c60fe3631349a3a29efbb204360ac118fcbda9e 1611 1522 2009-07-06T21:25:33Z Chris goe 2 formatting fixes Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of ReiserFS is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. == Software Engineering Based Reiser4 Design Principles == === Equal Source Code Access Is A Civil Right === Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). === Software Libre Takes More Than A License --- It Takes A Design === Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. === Why Limit Interactions With Objects Strictly? === You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. == Basic Semantics == To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. === Files === character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. ==== The Software Engineering Lurking Below File Plugins ==== Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. === Names and Objects === A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. === Ordering of Name Components === When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. === Directories === Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. ==== The Unix Directory Plugin ==== The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. Some Historical Details Of Design Flaws In The Unix Directory Interface Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. Files That Are Also Directories In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. New Security Attributes and Set Theoretic Semantic Purity character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: 1. A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. 2. Each of the subtrees is a tree. 3. No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. 4. The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. Fine Points of the Definition The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. Graphs vs. Trees Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. Choosing Which Subtree We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. Nodes Leaves, Twigs, and Branches Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) Size of Nodes We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. Items Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. Units We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). Fanout The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. What Are B+Trees, and Why Are They Better than B-Trees It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. Cache Design Principles Reiser's Untie The Uncorrelated Principle of Cache Design Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. Reiser's Maximize The Variance Principle of Cache Design If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. 1. If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. 2. If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. 3. Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. Segregating By Temperature Directly Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. Dancing Trees Are Faster Than Balanced Trees character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. A Brief History Of How Filesystems Have Handled Crashes Filesystem Checkers Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. Fixed Location Journaling A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. Wandering Logs One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations Copy-on-capture Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. Key Assignment Plugins When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) Aggregation Is Best Implemented As Inheritance In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html Steps For Creating A Security Attribute You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. HOME Citations: * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] [[category:Formatting-fixes-needed]] 7c60fe3631349a3a29efbb204360ac118fcbda9e 1522 1363 2009-06-27T19:22:28Z Chris goe 2 formatting-fixes-needed Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of reiserfs is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. Table of Contents 1. Software Engineering Based Reiser4 Design Principles 1. Equal Source Code Access Is A Civil Right 2. Software Libre Takes More Than A License --- It Takes A Design 3. Why Limit Interactions With Objects Strictly? 2. Basic Semantics 1. Files 1. The Software Engineering Lurking Below File Plugins 2. Names and Objects 3. Ordering of Name Components 4. Directories 1. The Unix Directory Plugin 2. Some Historical Details Of Design Flaws In The Unix Directory Interface 3. Directories Are Unordered 4. Files That Are Also Directories 5. Hidden Directory Entries 5. New Security Attributes and Set Theoretic Semantic Purity 1. Minimizing Number Of Primitives Is Important In Abstract Constructions 2. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? 3. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: 3. Basic Tree Concepts 1. Trees, Nodes, and Items 1. Definition of Tree: 2. Fine Points of the Definition 3. Graphs vs. Trees 4. Ordering The Tree Aids Searching Through It 1. Keys 2. Choosing Which Subtree 2. Nodes 1. Leaves, Twigs, and Branches 2. Size of Nodes 3. Sharing Blocks Saves Space 3. Items 1. The Structure of an Item 2. Types Of Items 3. Units 4. What the Default Node Formats For ReiserFS 4.0 Look Like 4. Tree Design Concepts 1. Height Balancing versus Space Balancing 2. Three principle considerations in tree design 3. Fanout 4. What Are B+Trees, and Why Are They Better than B-Trees 1. B+Trees Have Higher Fanout Than B-Trees 2. Cache Design Principles 1. Reiser's Untie The Uncorrelated Principle of Cache Design 2. Reiser's Maximize The Variance Principle of Cache Design 3. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To 4. Segregating By Temperature Directly 5. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance 5. Dancing Trees Are Faster Than Balanced Trees 1. If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing 2. Procrastination Leads To Wiser Decisions: Allocate on Flush 5. Reiser4 The Atomic Filesystem 1. Reducing The Damage of Crashing 2. A Brief History Of How Filesystems Have Handled Crashes 3. Filesystem Checkers 4. Fixed Location Journaling 5. Wandering Logs 6. Writing Twice May Be Optimal Sometimes 7. Committing 8. Journalling optimizations 1. Copy-on-capture 2. Steal-on-capture 6. Repacker 7. Plugins 1. 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going 1. File Plugins 2. Directory Plugins 3. Hash Plugins 4. Security Plugins 5. Item Plugins 6. Key Assignment Plugins 7. Node Search and Item Search Plugins 8. Putting Your New Plugin To Work Will Mean Recompiling 2. Without Plugins We Will Drown 3. Plugins: FS Programming For The Lazy 8. Enhancing Security 1. Fine Graining Security 1. Good Security Requires Precision In Specification Of Security 2. Space Efficiency Concerns Motivate Imprecise Security 3. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align 4. /etc/passwd As Example 5. Aggregating Files Can Improve The User Interface To Them 6. How Do We Write Modifications To An Aggregation 7. Aggregation Is Best Implemented As Inheritance 8. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax 2. API Suitable For Accessing Files That Store Security Attributes 1. Flaws In Traditional File API When Applied To Security Attributes 2. The Usual Resolution Of These Flaws Is A One-Off Solution 3. One-Off Solutions Are A Lot of Work To Do A Lot Of 3. Steps For Creating A Security Attribute 1. reiser4() System Call Description 2. Constraints 3. Auditing 4. Increasing the Allowed Granularity of Security 5. Encryption On Commit 9. Conclusion 10. Citations Software Engineering Based Reiser4 Design Principles Equal Source Code Access Is A Civil Right Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). Software Libre Takes More Than A License --- It Takes A Design Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. Why Limit Interactions With Objects Strictly? You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. Basic Semantics To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. Files character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. The Software Engineering Lurking Below File Plugins Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. Names and Objects A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. Ordering of Name Components When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. Directories Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. The Unix Directory Plugin The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. Some Historical Details Of Design Flaws In The Unix Directory Interface Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. Files That Are Also Directories In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. New Security Attributes and Set Theoretic Semantic Purity character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: 1. A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. 2. Each of the subtrees is a tree. 3. No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. 4. The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. Fine Points of the Definition The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. Graphs vs. Trees Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. Choosing Which Subtree We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. Nodes Leaves, Twigs, and Branches Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) Size of Nodes We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. Items Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. Units We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). Fanout The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. What Are B+Trees, and Why Are They Better than B-Trees It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. Cache Design Principles Reiser's Untie The Uncorrelated Principle of Cache Design Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. Reiser's Maximize The Variance Principle of Cache Design If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. 1. If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. 2. If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. 3. Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. Segregating By Temperature Directly Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. Dancing Trees Are Faster Than Balanced Trees character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. A Brief History Of How Filesystems Have Handled Crashes Filesystem Checkers Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. Fixed Location Journaling A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. Wandering Logs One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations Copy-on-capture Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. Key Assignment Plugins When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) Aggregation Is Best Implemented As Inheritance In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html Steps For Creating A Security Attribute You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. HOME Citations: * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] [[category:Formatting-fixes-needed]] e7312272a8ecfd0d26798abde1465a4d76f6d808 1363 1319 2009-06-25T09:06:17Z Chris goe 2 category added Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of reiserfs is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. Table of Contents 1. Software Engineering Based Reiser4 Design Principles 1. Equal Source Code Access Is A Civil Right 2. Software Libre Takes More Than A License --- It Takes A Design 3. Why Limit Interactions With Objects Strictly? 2. Basic Semantics 1. Files 1. The Software Engineering Lurking Below File Plugins 2. Names and Objects 3. Ordering of Name Components 4. Directories 1. The Unix Directory Plugin 2. Some Historical Details Of Design Flaws In The Unix Directory Interface 3. Directories Are Unordered 4. Files That Are Also Directories 5. Hidden Directory Entries 5. New Security Attributes and Set Theoretic Semantic Purity 1. Minimizing Number Of Primitives Is Important In Abstract Constructions 2. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? 3. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: 3. Basic Tree Concepts 1. Trees, Nodes, and Items 1. Definition of Tree: 2. Fine Points of the Definition 3. Graphs vs. Trees 4. Ordering The Tree Aids Searching Through It 1. Keys 2. Choosing Which Subtree 2. Nodes 1. Leaves, Twigs, and Branches 2. Size of Nodes 3. Sharing Blocks Saves Space 3. Items 1. The Structure of an Item 2. Types Of Items 3. Units 4. What the Default Node Formats For ReiserFS 4.0 Look Like 4. Tree Design Concepts 1. Height Balancing versus Space Balancing 2. Three principle considerations in tree design 3. Fanout 4. What Are B+Trees, and Why Are They Better than B-Trees 1. B+Trees Have Higher Fanout Than B-Trees 2. Cache Design Principles 1. Reiser's Untie The Uncorrelated Principle of Cache Design 2. Reiser's Maximize The Variance Principle of Cache Design 3. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To 4. Segregating By Temperature Directly 5. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance 5. Dancing Trees Are Faster Than Balanced Trees 1. If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing 2. Procrastination Leads To Wiser Decisions: Allocate on Flush 5. Reiser4 The Atomic Filesystem 1. Reducing The Damage of Crashing 2. A Brief History Of How Filesystems Have Handled Crashes 3. Filesystem Checkers 4. Fixed Location Journaling 5. Wandering Logs 6. Writing Twice May Be Optimal Sometimes 7. Committing 8. Journalling optimizations 1. Copy-on-capture 2. Steal-on-capture 6. Repacker 7. Plugins 1. 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going 1. File Plugins 2. Directory Plugins 3. Hash Plugins 4. Security Plugins 5. Item Plugins 6. Key Assignment Plugins 7. Node Search and Item Search Plugins 8. Putting Your New Plugin To Work Will Mean Recompiling 2. Without Plugins We Will Drown 3. Plugins: FS Programming For The Lazy 8. Enhancing Security 1. Fine Graining Security 1. Good Security Requires Precision In Specification Of Security 2. Space Efficiency Concerns Motivate Imprecise Security 3. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align 4. /etc/passwd As Example 5. Aggregating Files Can Improve The User Interface To Them 6. How Do We Write Modifications To An Aggregation 7. Aggregation Is Best Implemented As Inheritance 8. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax 2. API Suitable For Accessing Files That Store Security Attributes 1. Flaws In Traditional File API When Applied To Security Attributes 2. The Usual Resolution Of These Flaws Is A One-Off Solution 3. One-Off Solutions Are A Lot of Work To Do A Lot Of 3. Steps For Creating A Security Attribute 1. reiser4() System Call Description 2. Constraints 3. Auditing 4. Increasing the Allowed Granularity of Security 5. Encryption On Commit 9. Conclusion 10. Citations Software Engineering Based Reiser4 Design Principles Equal Source Code Access Is A Civil Right Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). Software Libre Takes More Than A License --- It Takes A Design Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. Why Limit Interactions With Objects Strictly? You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. Basic Semantics To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. Files character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. The Software Engineering Lurking Below File Plugins Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. Names and Objects A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. Ordering of Name Components When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. Directories Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. The Unix Directory Plugin The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. Some Historical Details Of Design Flaws In The Unix Directory Interface Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. Files That Are Also Directories In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. New Security Attributes and Set Theoretic Semantic Purity character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: 1. A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. 2. Each of the subtrees is a tree. 3. No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. 4. The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. Fine Points of the Definition The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. Graphs vs. Trees Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. Choosing Which Subtree We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. Nodes Leaves, Twigs, and Branches Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) Size of Nodes We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. Items Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. Units We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). Fanout The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. What Are B+Trees, and Why Are They Better than B-Trees It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. Cache Design Principles Reiser's Untie The Uncorrelated Principle of Cache Design Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. Reiser's Maximize The Variance Principle of Cache Design If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. 1. If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. 2. If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. 3. Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. Segregating By Temperature Directly Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. Dancing Trees Are Faster Than Balanced Trees character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. A Brief History Of How Filesystems Have Handled Crashes Filesystem Checkers Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. Fixed Location Journaling A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. Wandering Logs One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations Copy-on-capture Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. Key Assignment Plugins When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) Aggregation Is Best Implemented As Inheritance In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html Steps For Creating A Security Attribute You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. HOME Citations: * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. [[category:Reiser4]] d7538ff3dbbfc84e4d145e6709823fbe3ce97df2 1319 2009-06-25T07:21:43Z Chris goe 2 http://web.archive.org/web/20061113154555/http://www.namesys.com/v4/v4.html Reasons why Reiser4 is great for you: * Reiser4 is the fastest filesystem, and here are the benchmarks. * Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice. * Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem. * Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins.... * Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function. V3 of reiserfs is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3. Table of Contents 1. Software Engineering Based Reiser4 Design Principles 1. Equal Source Code Access Is A Civil Right 2. Software Libre Takes More Than A License --- It Takes A Design 3. Why Limit Interactions With Objects Strictly? 2. Basic Semantics 1. Files 1. The Software Engineering Lurking Below File Plugins 2. Names and Objects 3. Ordering of Name Components 4. Directories 1. The Unix Directory Plugin 2. Some Historical Details Of Design Flaws In The Unix Directory Interface 3. Directories Are Unordered 4. Files That Are Also Directories 5. Hidden Directory Entries 5. New Security Attributes and Set Theoretic Semantic Purity 1. Minimizing Number Of Primitives Is Important In Abstract Constructions 2. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? 3. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: 3. Basic Tree Concepts 1. Trees, Nodes, and Items 1. Definition of Tree: 2. Fine Points of the Definition 3. Graphs vs. Trees 4. Ordering The Tree Aids Searching Through It 1. Keys 2. Choosing Which Subtree 2. Nodes 1. Leaves, Twigs, and Branches 2. Size of Nodes 3. Sharing Blocks Saves Space 3. Items 1. The Structure of an Item 2. Types Of Items 3. Units 4. What the Default Node Formats For ReiserFS 4.0 Look Like 4. Tree Design Concepts 1. Height Balancing versus Space Balancing 2. Three principle considerations in tree design 3. Fanout 4. What Are B+Trees, and Why Are They Better than B-Trees 1. B+Trees Have Higher Fanout Than B-Trees 2. Cache Design Principles 1. Reiser's Untie The Uncorrelated Principle of Cache Design 2. Reiser's Maximize The Variance Principle of Cache Design 3. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To 4. Segregating By Temperature Directly 5. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance 5. Dancing Trees Are Faster Than Balanced Trees 1. If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing 2. Procrastination Leads To Wiser Decisions: Allocate on Flush 5. Reiser4 The Atomic Filesystem 1. Reducing The Damage of Crashing 2. A Brief History Of How Filesystems Have Handled Crashes 3. Filesystem Checkers 4. Fixed Location Journaling 5. Wandering Logs 6. Writing Twice May Be Optimal Sometimes 7. Committing 8. Journalling optimizations 1. Copy-on-capture 2. Steal-on-capture 6. Repacker 7. Plugins 1. 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going 1. File Plugins 2. Directory Plugins 3. Hash Plugins 4. Security Plugins 5. Item Plugins 6. Key Assignment Plugins 7. Node Search and Item Search Plugins 8. Putting Your New Plugin To Work Will Mean Recompiling 2. Without Plugins We Will Drown 3. Plugins: FS Programming For The Lazy 8. Enhancing Security 1. Fine Graining Security 1. Good Security Requires Precision In Specification Of Security 2. Space Efficiency Concerns Motivate Imprecise Security 3. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align 4. /etc/passwd As Example 5. Aggregating Files Can Improve The User Interface To Them 6. How Do We Write Modifications To An Aggregation 7. Aggregation Is Best Implemented As Inheritance 8. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax 2. API Suitable For Accessing Files That Store Security Attributes 1. Flaws In Traditional File API When Applied To Security Attributes 2. The Usual Resolution Of These Flaws Is A One-Off Solution 3. One-Off Solutions Are A Lot of Work To Do A Lot Of 3. Steps For Creating A Security Attribute 1. reiser4() System Call Description 2. Constraints 3. Auditing 4. Increasing the Allowed Granularity of Security 5. Encryption On Commit 9. Conclusion 10. Citations Software Engineering Based Reiser4 Design Principles Equal Source Code Access Is A Civil Right Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software. Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today. We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors. Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value. Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization. We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena). Software Libre Takes More Than A License --- It Takes A Design Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes. This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software. Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need. Why Limit Interactions With Objects Strictly? You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows? Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes). The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics) This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable. You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind. One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces. If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas. Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files. Basic Semantics To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something. Files character holding an object that looks like a sequence A file is something that tries to look like a sequence of bytes. You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file. character sawing off end of file Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday. The Software Engineering Lurking Below File Plugins Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods. Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture. Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.) This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading. Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces. characters each communicating with adjacent characters only Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree. An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted. When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance. For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order. Names and Objects A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request. A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified. Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer. Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object. Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well. Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries). An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique. Ordering of Name Components When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.) The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs. Directories Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name. Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory. A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.) Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite. In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics. The Unix Directory Plugin The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named. Some Historical Details Of Design Flaws In The Unix Directory Interface Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory. The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.) There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented. Files That Are Also Directories In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS. Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories. To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant. New Security Attributes and Set Theoretic Semantic Purity character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories. Can We Get By Using Just Files and Directories (Composing Streams And Attributes From Files And Directories)? In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree. Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer? In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want. Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file. Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure. List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories: * api efficient for small files * efficient storage for small files * plugins, including plugins that can compress a file serving as an attribute into a single bit * files that also act as directories when accessed as directories * inheritance (includes file aggregation) * constraints * transactions * hidden directory entries Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items One way of organizing information is to put it into trees. When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile. A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children. Figure 1. One Example Of A Tree. Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes. We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree: 1. A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees. 2. Each of the subtrees is a tree. 3. No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree. 4. The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree. Fine Points of the Definition The absolutely most trivial of all graphs, the single, isolated node. Figure 2. The simplest tree. A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers). Figure 3. A trivial, linear tree. It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes. Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-) Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions). Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers: For all three of these definitions, let there be not more than one edge connecting the same two nodes. * a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices * or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself) * or a set of vertices connected by edges for which there is exactly one path connecting any two vertices The three alternative definitions do not have a unique root in their tree, and such trees are called free trees. The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node. Please feel encouraged to read Knuth's writings for more discussions of these topics. Graphs vs. Trees Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things. This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer. Choosing Which Subtree We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node. Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for. Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time. The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key. Nodes Leaves, Twigs, and Branches Leaves are nodes that have no children. Internal nodes are nodes that have children. A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. ) Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children. A node that contains items is called a formatted node. If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k. Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes. Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be. Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date. Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs. You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be. For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space. An example of a Reiser4 tree: A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children. Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-) Size of Nodes We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms. If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration. Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node. Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify. Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice. Items Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is. Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Types Of Items Reiser4 includes many different kinds of items designed to hold different kinds of information. * static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file. * cmpnd_dir_item: holds directory entries, and the keys of the files they link to. * extent pointers explained above * node pointers: explained above * bodies: holds parts of files that are not large enough to be stored in unfleafs. Units We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units: * For body items the units are bytes. * For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed). * For extent items the units are extents. Extent items only contain extents from the same file. * For static_stat_data the whole stat data item is one indivisible unit of fixed size. What the Default Node Formats For ReiserFS 4.0 Look Like An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: ................................................................................................................................................................................................................................. The Structure of an Item Item_Body . . separated . . Item_Head Item_Key Item_Offset Item_Length Item_Plugin_id Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0 A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0 A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1. The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature. Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance. This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced Figure 6. This is an unbalanced tree. Three principle considerations in tree design Three of the principle considerations in tree design are: * the fanout rate (see below) * the tightness of packing * the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk). Fanout The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....). A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes. What Are B+Trees, and Why Are They Better than B-Trees It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did. Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level. Figure 8. Figure 9. Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes. Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder. We need to discuss some cache design principles for a while before we can get to this. Cache Design Principles Reiser's Untie The Uncorrelated Principle of Cache Design Tying the caching of things whose usage does not strongly correlate is bad. Suppose: * you have two sets of things, A and B. * you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time. * you can keep things around after you use them in a cache of limited size. * you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache) Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad. But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed. This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you? Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle. Reiser's Maximize The Variance Principle of Cache Design If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching. For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together. Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective. 1. If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache. 2. If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole. 3. Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching. Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to. Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data. According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching. Segregating By Temperature Directly Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong. If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree. This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous. Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like. BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets. Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes. This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous. Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes. Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...) My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results. Dancing Trees Are Faster Than Balanced Trees character shoving tree-like characters to left Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes. The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory. It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well. Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic.... In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead: * in response to memory pressure triggering a flush to disk, * as a result of a transaction closure flushing nodes to disk If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums. It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see. Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree. By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout. character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk. Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B. Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes? When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance. Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done. A Brief History Of How Filesystems Have Handled Crashes Filesystem Checkers Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker. Fixed Location Journaling A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem. The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated. For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code. With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees. Wandering Logs One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area. There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns. If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies. Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner. The relocatable set is the set of blocks that have a dirty parent in the atom. The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten. The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set. The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits. An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit. For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible. When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it. There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations Copy-on-capture Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction. With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency. Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing. I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4. Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter. Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered". We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins Security plugins handle all security checks. They are normally invoked by file and directory plugins. Example of reading a file: * Access the pluginid for the file. * Invoke the read method for the plugin. * The read method determines the security plugin for the file. * That security plugin invokes its read check method for determining whether to permit the read. * The read check method for the security plugin reads file/attributes containing the permissions on the file. * Since file/attributes are also files, this means invoking the plugin for reading the file/attribute. * The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it). * The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.) * The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs. * The security plugin determines that the read should be permitted. * The read method continues and completes. Item Plugins The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc). In addition to all of the balancing operations, item plugins will also implement intra-item search plugins. V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now. Key Assignment Plugins When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this. character almost drowning while other character hands him a plugin Without Plugins We Will Drown People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code? The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security superman character complaining about emergency By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security Good Security Requires Precision In Specification Of Security Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems. Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users? There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation? Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed. One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.) Aggregation Is Best Implemented As Inheritance In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes: * Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute. * A system call for every attribute accessed is too much overhead when accessing lots of little attributes. * Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways. * Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail. The Usual Resolution Of These Flaws Is A One-Off Solution The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features. This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html Steps For Creating A Security Attribute You can create a new security attribute by: * Defining a pluginid. * Composing a set of methods for the plugin from ones you create or reuse from other existing plugins. * Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files). * Implementing item handlers for all of the new items you create. * Creating a key assignment algorithm for all of the new items. reiser4() System Call Description The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas. Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators. lhs(assignment target) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor. * /filename assigns to the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) writes to the body, starting at offset, ending not past last_byte * /filename/..../range/(offset<-(loff_t) ) writes to the body starting at ofset rhs (assignment source) values: * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance. * /filename reads the entirety of the file named filename. * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t)) reads from the body, starting at first_byte, ending not past last_byte * /filename/..../range/(offset<-(loff_t)) reads from the body starting at offset until the end * /filename/..../stat/owner reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.) Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints (Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place. Note that ACLs will also embody write constraints. We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur. It can be useful to have read constraints as well as write constraints. Auditing (Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access. With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.) The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security man holding sieve, only objects of a certain size go through. (This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk. Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem. HOME Citations: * [Gray93] Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2 * [Hitz94] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html * [TR3001] D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html * [TR3002] D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html * [Ousterh89] J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html * [Seltzer95] M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html * [Seltzer95Supp] M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/ * [Ousterh93Crit] J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html * [Ousterh95Crit] J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html * [SwD96] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html * [VelskiiLandis] G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on. * [Bach] Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach. * [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html * [ext2fs] by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD. * [FFS] M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html. * [Ganger] Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only) * [Gifford] Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only). * [Hitz, Dave] A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html * [Holton and Das] Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html * [Howard] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman] The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html * [OLE] "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm. * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry.... * [Peacock] K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, "Unix Kernal Internals" * [Reiser93] Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html. b3df3cc57a928ca51a39ce73b89f53f03b115e34 Why Reiser4 0 1103 4456 4454 2020-12-10T19:25:00Z Edward 4 /* Different transaction models */ = Summary = Reiser4 is not only a file system. It is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, specifically, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some particular operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the [[Reiser4_patchsets#Stable_patchsets|project sites]]. = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only "write-anywhere" (ZFS, Btrfs, etc) transaction model. = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata checksums = A special format of storage tree nodes allows to support [https://reiser4.wiki.kernel.org/index.php/Reiser4_checksums checksums], which protect your meta-data from corruption. = Software Framework, Development model and compatibility = Reiser4 codebase is organized as a set of interacting internal modules, called plugins. At run time every calculation in Reiser4 context is actually execution of some "supplier" plugin, called by a "customer" plugin with some parameters, being passed in accordance with the supplier's interface. Thus, every call stack is actually a path in a special directed acyclic graph of interfaces, which is an "abstraction base" of Reiser4 codebase. Reiser4 development obeys a [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model model] which is a set of formal actions, aimed to make all releases (backward) compatible. [[category:Reiser4]] a345a545828e3d764a038af1cae94d91e47491a0 4454 4397 2020-12-10T14:59:44Z Edward 4 /* History of Reiser4 */ = Summary = Reiser4 is not only a file system. It is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, specifically, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some particular operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the [[Reiser4_patchsets#Stable_patchsets|project sites]]. = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata checksums = A special format of storage tree nodes allows to support [https://reiser4.wiki.kernel.org/index.php/Reiser4_checksums checksums], which protect your meta-data from corruption. = Software Framework, Development model and compatibility = Reiser4 codebase is organized as a set of interacting internal modules, called plugins. At run time every calculation in Reiser4 context is actually execution of some "supplier" plugin, called by a "customer" plugin with some parameters, being passed in accordance with the supplier's interface. Thus, every call stack is actually a path in a special directed acyclic graph of interfaces, which is an "abstraction base" of Reiser4 codebase. Reiser4 development obeys a [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model model] which is a set of formal actions, aimed to make all releases (backward) compatible. [[category:Reiser4]] 0e1cf69678fc0f927dabbfb6bfb1fea1500abe91 4397 4396 2020-08-18T19:59:30Z Edward 4 /* Software Framework, Development model and compatibility */ = Summary = Reiser4 is not only a file system. It is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, specifically, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some particular operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the [[Reiser4_patchsets#Stable_patchsets|project sites]]. = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata checksums = A special format of storage tree nodes allows to support [https://reiser4.wiki.kernel.org/index.php/Reiser4_checksums checksums], which protect your meta-data from corruption. = Software Framework, Development model and compatibility = Reiser4 codebase is organized as a set of interacting internal modules, called plugins. At run time every calculation in Reiser4 context is actually execution of some "supplier" plugin, called by a "customer" plugin with some parameters, being passed in accordance with the supplier's interface. Thus, every call stack is actually a path in a special directed acyclic graph of interfaces, which is an "abstraction base" of Reiser4 codebase. Reiser4 development obeys a [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model model] which is a set of formal actions, aimed to make all releases (backward) compatible. [[category:Reiser4]] 43b82a3084df096dbc7ec189812273605a185283 4396 4328 2020-08-18T19:55:45Z Edward 4 = Summary = Reiser4 is not only a file system. It is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, specifically, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some particular operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the [[Reiser4_patchsets#Stable_patchsets|project sites]]. = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata checksums = A special format of storage tree nodes allows to support [https://reiser4.wiki.kernel.org/index.php/Reiser4_checksums checksums], which protect your meta-data from corruption. = Software Framework, Development model and compatibility = Reiser4 codebase is organized as a set of interacting internal modules, called plugins. At run time every calculation in Reiser4 context is actually execution of some "supplier" plugin, called by a "customer" plugin with some parameters in accordance with the supplier's interface. Thus, every call stack is actually a path in a special directed acyclic graph of interfaces, which is an "abstraction base" of Reiser4 codebase. Reiser4 development obeys a [https://reiser4.wiki.kernel.org/index.php/Reiser4_development_model model] which is a set of formal actions, aimed to make all releases (backward) compatible. [[category:Reiser4]] f0fc95b9209aaaccaa00306b65ee4f97a36c8cb2 4328 4313 2019-12-31T10:51:49Z Edward 4 = Summary = Reiser4 is not only a file system. It is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, specifically, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some particular operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the [[Reiser4_patchsets#Stable_patchsets|project sites]]. = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (unstable stuff) = = Software Framework, Development model and compatibility = [[category:Reiser4]] a022001b70e1f50f2cdcd12b78bea77ea9fab38e 4313 4311 2019-04-15T20:54:04Z Chris goe 2 link to the project sites, so we don't have the same URLs all over the place = Summary = Reiser4 is not only a file system. It is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, specifically, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some particular operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the [[Reiser4_patchsets#Stable_patchsets|project sites]]. = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibility = [[category:Reiser4]] bcf7121c59d3bfe322ce84f4531e53b31bd654c8 4311 4307 2019-04-13T16:17:58Z Edward 4 /* Reiser4 and upstream */ - add links = Summary = Reiser4 is not only a file system. It is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, specifically, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some particular operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the project's sites on [https://github.com/edward6/reiser4/ GitHub] and [https://sourceforge.net/projects/reiser4/files/reiser4-for-linux-5.x/ Sourceforge]. = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibility = [[category:Reiser4]] 029588543d504ac4684bbdd987c3c613d747c38c 4307 4283 2019-04-13T15:59:56Z Edward 4 /* Summary */ = Summary = Reiser4 is not only a file system. It is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*). = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibility = [[category:Reiser4]] 33b0335db6b8cd993130627d6a3261fdb5bf2ac0 4283 4277 2017-07-14T14:23:28Z Edward 4 Fixed typo = Summary = Reiser4 is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*). = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibility = [[category:Reiser4]] 785724bf6ee531dddce98be2e28c4895e8c5b302 4277 4275 2017-07-11T15:50:59Z Edward 4 /* Reiser4 and upstream */ = Summary = Reiser4 is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*). = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibolity = [[category:Reiser4]] fa0baf35550a93afac5270f8bda0bdcb6408b1ba 4275 4243 2017-07-11T15:48:08Z Edward 4 More precise description of Hybrid Transaction Model = Summary = Reiser4 is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of the patches for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*). = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibolity = [[category:Reiser4]] 1d77c28ba4b5e46ca76ea2a6bdf351c3d1af7c0e 4243 4229 2017-06-20T23:23:30Z Chris goe 2 category added = Summary = Reiser4 is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of the patches for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*). = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibolity = [[category:Reiser4]] b636c77128a22294e407c3ad65330f5c3a9719b4 4229 4225 2017-05-09T14:07:48Z Edward 4 /* Precise Discard support for SSD drives */ = Summary = Reiser4 is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of the patches for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*). = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Asynchronous Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibolity = 93616ab8ca27da6f741a79040ca964d83d401a8c 4225 4221 2017-05-09T13:06:13Z Edward 4 Fixed typo = Summary = Reiser4 is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of the patches for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*). = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibolity = 2a4d6046360185487eb478d53e8eb7c7f075c5ab 4221 2017-05-09T12:58:55Z Edward 4 Added "Why Reiser4" page = Summary = Reiser4 is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure. = History of Reiser4 = Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*). = Reiser4 and upstream = In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of the patches for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*). = Efficiency of disk space usage = Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk). = Reiser4 structure. Plugins. Heterogeneity in time and in space = File system as a complicated subsystem of modern OS is most subject to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions. For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies. = Atomicity of operations = Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction. = Different transaction models = Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc). = Three-level block allocator = The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. = Off-line file system check = Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired. = Precise Discard support for SSD drives = In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded). = Metadata and inline-data checksums (not stable stuff) = = Software Framework, Development model and compatibolity = 74d170f105ca59e62cc9080eff39aec5128a6b39 X0reiserfs 0 18 2182 1748 2010-12-21T11:47:24Z Chris goe 2 formatting fixes {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. == ReiserFS structures == ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head * The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key0 Key1 Key2 --- KeyN Pointer0 Pointer1 Pointer2 --- PointerN PointerN+1 ..Free Space.. * The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead0 IHead1 IHead2 --- IHeadN ...Free Space... ItemN --- Item2 Item1 Item0 * The disk Block (Unformatted Node of the tree is the place for the data of the big file) === ReiserFS objects: Files, Directories === Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items: * Files items: # StatData item + [Direct item] (for small files: size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) # StatData item + InDirect item + [Direct item] (for big files: size > MAX_DIRECT_ITEM_LEN bytes) * Directory items: # StatData item # Directory item Every reiserfs object has Object ID and Key. === Internal Node structures === The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. <pre> struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree (1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes </pre> === Leaf Node structures === <pre> The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ...Free Space... Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). </pre> = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: # the mapping of logical block numbers to physical locations on disk # the assigning of nodes to logical block numbers # the ordering of objects within the tree, and # the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: # minimize number of nodes used # minimize number of nodes affected by the balancing operation # minimize the number of uncached nodes affected by the balancing operation # if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: <pre> /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } </pre> It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = * G.M. Adel'son-Vel'skii and E.M. Landis, [http://en.scientificcommons.org/19884302 An algorithm for the organization of information], Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Apple Computer Inc, [http://books.google.com/books?as_isbn=0201177323 Inside Macintosh, Files], Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. * [Bach] Maurice J. Bach, [http://portal.acm.org/citation.cfm?id=8570 The Design of the Unix Operating System], 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: [http://portal.acm.org/citation.cfm?id=582353.582390 On Extending the Functions of a Relational Database System]. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. * [Chen] Chen, P.M. Patterson, David A., [http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6129.html A New Approach to I/O Performance Evaluation] -- Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, [http://www.ece.cmu.edu/~ganger/papers/cffs.html Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files]. A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. * [ext2fs] by Rémy Card, [http://e2fsprogs.sourceforge.net/ext2intro.html Design and Implementation of the Second Extended Filesystem]. Extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. * [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. [http://www.eecs.berkeley.edu/~brewer/cs262/FFS.pdf A fast file system for UNIX]. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) * [Ganger] Gregory R. Ganger, Yale N. Patt, [http://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers/softupdates-osdi94.pdf Metadata Update Performance in File Systems] * [Gifford], [http://portal.acm.org/citation.cfm?id=121133.121138 Semantic file systems]. Describes a file system enriched to have more than hierarchical semantics], he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. * [Hitz, Dave] [http://media.netapp.com/documents/wp_3002.pdf File System Design for an NFS File Server Appliance]. A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. * [Holton and Das] , Holton, Mike., Das, Raj., [http://www.uoks.uj.edu.pl/resources/flugor/IRIX/xfs-whitepaper.html XFS: A Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O]: "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)." Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. * [Howard] [http://www.cs.cmu.edu/~satya/docdir/s11.pdf Scale and Performance in a Distributed File System], Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. * [Knuth] Knuth, D.E., [http://www-cs-faculty.stanford.edu/~knuth/taocp.html The Art of Computer Programming], Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith., [http://www.spec.org/sfs93/doc/WhitePaper.ps LADDIS: The Next Generation in NFS File Server Benchmarking], Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry [http://portal.acm.org/citation.cfm?id=548586 Data Structures & Their Algorithms], HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., [http://portal.acm.org/citation.cfm?id=359839 Pagination of B*-trees with variable length records], Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman], [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2970&rep=rep1&type=pdf Extent−like Performance from a UNIX File System]: The implementation of write-clustering for Sun's UFS. * [OLE] [http://portal.acm.org/citation.cfm?id=207534 Inside OLE] by Kraig Brockshmidt, discusses Structured Storage * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. [http://portal.acm.org/citation.cfm?id=323627.323631 A trace-driven analysis of the UNIX 4.2BSD file system]. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15-24, Orcas Island, WA, December 1985. * [NTFS] [http://portal.acm.org/citation.cfm?id=527752 Inside the Windows NT File System]. The book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. * [Peacock] Dr. J. Kent Peacock, "The CounterPoint Fast File System", Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf The Hideous Name], USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. His discussion of naming in plan 9: [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9] * [Rosenblum and Ousterhout] [http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf The Design and Implementation of a Log-Structured File System], Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse [http://www.eecs.harvard.edu/~margo/papers/icde93/ Transaction Support in a Log-Structured File System] and the arguments by Margo Seltzer it links to. * [Snyder] , [http://www.solarisinternals.com/si/reading/tmpfs.pdf tmpfs: A Virtual Memory File System] discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, [http://books.google.com/books?as_isbn=0131019082 UNIX internals: the new frontiers] [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 28dc9ccd7c294bc2807a23d4893e87fd3b430010 1748 1747 2010-04-25T04:54:47Z Chris goe 2 more cleanups to come... {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ...Free Space... Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ........................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items: Files items: 1. StatData item + [Direct item] (for small files: size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files: size > MAX_DIRECT_ITEM_LEN bytes) Directory items: 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. <pre> struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). </pre> = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: # the mapping of logical block numbers to physical locations on disk # the assigning of nodes to logical block numbers # the ordering of objects within the tree, and # the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: # minimize number of nodes used # minimize number of nodes affected by the balancing operation # minimize the number of uncached nodes affected by the balancing operation # if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: <pre> /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } </pre> It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = * G.M. Adel'son-Vel'skii and E.M. Landis, [http://en.scientificcommons.org/19884302 An algorithm for the organization of information], Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Apple Computer Inc, [http://books.google.com/books?as_isbn=0201177323 Inside Macintosh, Files], Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. * [Bach] Maurice J. Bach, [http://portal.acm.org/citation.cfm?id=8570 The Design of the Unix Operating System], 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: [http://portal.acm.org/citation.cfm?id=582353.582390 On Extending the Functions of a Relational Database System]. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. * [Chen] Chen, P.M. Patterson, David A., [http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6129.html A New Approach to I/O Performance Evaluation] -- Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, [http://www.ece.cmu.edu/~ganger/papers/cffs.html Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files]. A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. * [ext2fs] by Rémy Card, [http://e2fsprogs.sourceforge.net/ext2intro.html Design and Implementation of the Second Extended Filesystem]. Extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. * [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. [http://www.eecs.berkeley.edu/~brewer/cs262/FFS.pdf A fast file system for UNIX]. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) * [Ganger] Gregory R. Ganger, Yale N. Patt, [http://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers/softupdates-osdi94.pdf Metadata Update Performance in File Systems] * [Gifford], [http://portal.acm.org/citation.cfm?id=121133.121138 Semantic file systems]. Describes a file system enriched to have more than hierarchical semantics], he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. * [Hitz, Dave] [http://media.netapp.com/documents/wp_3002.pdf File System Design for an NFS File Server Appliance]. A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. * [Holton and Das] , Holton, Mike., Das, Raj., [http://www.uoks.uj.edu.pl/resources/flugor/IRIX/xfs-whitepaper.html XFS: A Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O]: "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)." Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. * [Howard] [http://www.cs.cmu.edu/~satya/docdir/s11.pdf Scale and Performance in a Distributed File System], Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. * [Knuth] Knuth, D.E., [http://www-cs-faculty.stanford.edu/~knuth/taocp.html The Art of Computer Programming], Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith., [http://www.spec.org/sfs93/doc/WhitePaper.ps LADDIS: The Next Generation in NFS File Server Benchmarking], Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry [http://portal.acm.org/citation.cfm?id=548586 Data Structures & Their Algorithms], HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., [http://portal.acm.org/citation.cfm?id=359839 Pagination of B*-trees with variable length records], Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman], [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2970&rep=rep1&type=pdf Extent−like Performance from a UNIX File System]: The implementation of write-clustering for Sun's UFS. * [OLE] [http://portal.acm.org/citation.cfm?id=207534 Inside OLE] by Kraig Brockshmidt, discusses Structured Storage * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. [http://portal.acm.org/citation.cfm?id=323627.323631 A trace-driven analysis of the UNIX 4.2BSD file system]. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15-24, Orcas Island, WA, December 1985. * [NTFS] [http://portal.acm.org/citation.cfm?id=527752 Inside the Windows NT File System]. The book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. * [Peacock] Dr. J. Kent Peacock, "The CounterPoint Fast File System", Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf The Hideous Name], USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. His discussion of naming in plan 9: [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9] * [Rosenblum and Ousterhout] [http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf The Design and Implementation of a Log-Structured File System], Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse [http://www.eecs.harvard.edu/~margo/papers/icde93/ Transaction Support in a Log-Structured File System] and the arguments by Margo Seltzer it links to. * [Snyder] , [http://www.solarisinternals.com/si/reading/tmpfs.pdf tmpfs: A Virtual Memory File System] discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, [http://books.google.com/books?as_isbn=0131019082 UNIX internals: the new frontiers] [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 032fe50c5b93a90e824053b72ad738bcdb7cbc23 1747 1746 2010-04-25T04:50:02Z Chris goe 2 use * {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: # the mapping of logical block numbers to physical locations on disk # the assigning of nodes to logical block numbers # the ordering of objects within the tree, and # the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: # minimize number of nodes used # minimize number of nodes affected by the balancing operation # minimize the number of uncached nodes affected by the balancing operation # if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: <pre> /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } </pre> It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = * G.M. Adel'son-Vel'skii and E.M. Landis, [http://en.scientificcommons.org/19884302 An algorithm for the organization of information], Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Apple Computer Inc, [http://books.google.com/books?as_isbn=0201177323 Inside Macintosh, Files], Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. * [Bach] Maurice J. Bach, [http://portal.acm.org/citation.cfm?id=8570 The Design of the Unix Operating System], 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: [http://portal.acm.org/citation.cfm?id=582353.582390 On Extending the Functions of a Relational Database System]. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. * [Chen] Chen, P.M. Patterson, David A., [http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6129.html A New Approach to I/O Performance Evaluation] -- Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, [http://www.ece.cmu.edu/~ganger/papers/cffs.html Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files]. A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. * [ext2fs] by Rémy Card, [http://e2fsprogs.sourceforge.net/ext2intro.html Design and Implementation of the Second Extended Filesystem]. Extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. * [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. [http://www.eecs.berkeley.edu/~brewer/cs262/FFS.pdf A fast file system for UNIX]. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) * [Ganger] Gregory R. Ganger, Yale N. Patt, [http://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers/softupdates-osdi94.pdf Metadata Update Performance in File Systems] * [Gifford], [http://portal.acm.org/citation.cfm?id=121133.121138 Semantic file systems]. Describes a file system enriched to have more than hierarchical semantics], he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. * [Hitz, Dave] [http://media.netapp.com/documents/wp_3002.pdf File System Design for an NFS File Server Appliance]. A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. * [Holton and Das] , Holton, Mike., Das, Raj., [http://www.uoks.uj.edu.pl/resources/flugor/IRIX/xfs-whitepaper.html XFS: A Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O]: "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)." Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. * [Howard] [http://www.cs.cmu.edu/~satya/docdir/s11.pdf Scale and Performance in a Distributed File System], Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. * [Knuth] Knuth, D.E., [http://www-cs-faculty.stanford.edu/~knuth/taocp.html The Art of Computer Programming], Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith., [http://www.spec.org/sfs93/doc/WhitePaper.ps LADDIS: The Next Generation in NFS File Server Benchmarking], Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry [http://portal.acm.org/citation.cfm?id=548586 Data Structures & Their Algorithms], HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., [http://portal.acm.org/citation.cfm?id=359839 Pagination of B*-trees with variable length records], Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman], [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2970&rep=rep1&type=pdf Extent−like Performance from a UNIX File System]: The implementation of write-clustering for Sun's UFS. * [OLE] [http://portal.acm.org/citation.cfm?id=207534 Inside OLE] by Kraig Brockshmidt, discusses Structured Storage * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. [http://portal.acm.org/citation.cfm?id=323627.323631 A trace-driven analysis of the UNIX 4.2BSD file system]. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15-24, Orcas Island, WA, December 1985. * [NTFS] [http://portal.acm.org/citation.cfm?id=527752 Inside the Windows NT File System]. The book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. * [Peacock] Dr. J. Kent Peacock, "The CounterPoint Fast File System", Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf The Hideous Name], USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. His discussion of naming in plan 9: [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9] * [Rosenblum and Ousterhout] [http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf The Design and Implementation of a Log-Structured File System], Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse [http://www.eecs.harvard.edu/~margo/papers/icde93/ Transaction Support in a Log-Structured File System] and the arguments by Margo Seltzer it links to. * [Snyder] , [http://www.solarisinternals.com/si/reading/tmpfs.pdf tmpfs: A Virtual Memory File System] discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, [http://books.google.com/books?as_isbn=0131019082 UNIX internals: the new frontiers] [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 7ffb4e567f1fe50d1c9847092dd4493fa9eb4569 1746 1745 2010-04-25T04:47:58Z Chris goe 2 use * {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: # the mapping of logical block numbers to physical locations on disk # the assigning of nodes to logical block numbers # the ordering of objects within the tree, and # the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: # minimize number of nodes used # minimize number of nodes affected by the balancing operation # minimize the number of uncached nodes affected by the balancing operation # if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: <pre> /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } </pre> It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = * G.M. Adel'son-Vel'skii and E.M. Landis, [http://en.scientificcommons.org/19884302 An algorithm for the organization of information], Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Apple Computer Inc, [http://books.google.com/books?as_isbn=0201177323 Inside Macintosh, Files], Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. * [Bach] Maurice J. Bach, [http://portal.acm.org/citation.cfm?id=8570 The Design of the Unix Operating System], 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: [http://portal.acm.org/citation.cfm?id=582353.582390 On Extending the Functions of a Relational Database System]. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. * [Chen] Chen, P.M. Patterson, David A., [http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6129.html A New Approach to I/O Performance Evaluation] -- Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, [http://www.ece.cmu.edu/~ganger/papers/cffs.html Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files]. A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. * [ext2fs] by Rémy Card, [http://e2fsprogs.sourceforge.net/ext2intro.html Design and Implementation of the Second Extended Filesystem]. Extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. * [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. [http://www.eecs.berkeley.edu/~brewer/cs262/FFS.pdf A fast file system for UNIX]. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) * [Ganger] Gregory R. Ganger, Yale N. Patt, [http://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers/softupdates-osdi94.pdf Metadata Update Performance in File Systems] * [Gifford], [http://portal.acm.org/citation.cfm?id=121133.121138 Semantic file systems]. Describes a file system enriched to have more than hierarchical semantics], he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. * [Hitz, Dave] [http://media.netapp.com/documents/wp_3002.pdf File System Design for an NFS File Server Appliance]. A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. * [Holton and Das] , Holton, Mike., Das, Raj., [http://www.uoks.uj.edu.pl/resources/flugor/IRIX/xfs-whitepaper.html XFS: A Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O]: "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)." Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. * [Howard] [http://www.cs.cmu.edu/~satya/docdir/s11.pdf Scale and Performance in a Distributed File System], Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. * [Knuth] Knuth, D.E., [http://www-cs-faculty.stanford.edu/~knuth/taocp.html The Art of Computer Programming], Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith., [http://www.spec.org/sfs93/doc/WhitePaper.ps LADDIS: The Next Generation in NFS File Server Benchmarking], Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry [http://portal.acm.org/citation.cfm?id=548586 Data Structures & Their Algorithms], HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., [http://portal.acm.org/citation.cfm?id=359839 Pagination of B*-trees with variable length records], Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman], [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2970&rep=rep1&type=pdf Extent−like Performance from a UNIX File System]: The implementation of write-clustering for Sun's UFS. * [OLE] [http://portal.acm.org/citation.cfm?id=207534 Inside OLE] by Kraig Brockshmidt, discusses Structured Storage * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. [http://portal.acm.org/citation.cfm?id=323627.323631 A trace-driven analysis of the UNIX 4.2BSD file system]. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15-24, Orcas Island, WA, December 1985. * [NTFS] [http://portal.acm.org/citation.cfm?id=527752 Inside the Windows NT File System]. The book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. * [Peacock] Dr. J. Kent Peacock, "The CounterPoint Fast File System", Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf The Hideous Name], USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. His discussion of naming in plan 9: [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9] * [Rosenblum and Ousterhout] [http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf The Design and Implementation of a Log-Structured File System], Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse [http://www.eecs.harvard.edu/~margo/papers/icde93/ Transaction Support in a Log-Structured File System] and the arguments by Margo Seltzer it links to. * [Snyder] , [http://www.solarisinternals.com/si/reading/tmpfs.pdf tmpfs: A Virtual Memory File System] discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, [http://books.google.com/books?as_isbn=0131019082 UNIX internals: the new frontiers] [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 3741056cf51d6c091c4e6603ee85166cd0991460 1745 1724 2010-04-25T04:47:16Z Chris goe 2 cleanup {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: # the mapping of logical block numbers to physical locations on disk # the assigning of nodes to logical block numbers # the ordering of objects within the tree, and # the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: # minimize number of nodes used # minimize number of nodes affected by the balancing operation # minimize the number of uncached nodes affected by the balancing operation # if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: <pre> /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } </pre> It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = * G.M. Adel'son-Vel'skii and E.M. Landis, [http://en.scientificcommons.org/19884302 An algorithm for the organization of information], Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Apple Computer Inc, [http://books.google.com/books?as_isbn=0201177323 Inside Macintosh, Files], Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. * [Bach] Maurice J. Bach, [http://portal.acm.org/citation.cfm?id=8570 The Design of the Unix Operating System], 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: [http://portal.acm.org/citation.cfm?id=582353.582390 On Extending the Functions of a Relational Database System]. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. * [Chen] Chen, P.M. Patterson, David A., [http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6129.html A New Approach to I/O Performance Evaluation] -- Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, [http://www.ece.cmu.edu/~ganger/papers/cffs.html Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files]. A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. * [ext2fs] by Rémy Card, [http://e2fsprogs.sourceforge.net/ext2intro.html Design and Implementation of the Second Extended Filesystem]. Extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. * [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. [http://www.eecs.berkeley.edu/~brewer/cs262/FFS.pdf A fast file system for UNIX]. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) * [Ganger] Gregory R. Ganger, Yale N. Patt, [http://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers/softupdates-osdi94.pdf Metadata Update Performance in File Systems] * [Gifford], [http://portal.acm.org/citation.cfm?id=121133.121138 Semantic file systems]. Describes a file system enriched to have more than hierarchical semantics], he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. * [Hitz, Dave] [http://media.netapp.com/documents/wp_3002.pdf File System Design for an NFS File Server Appliance]. A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. * [Holton and Das] , Holton, Mike., Das, Raj., [http://www.uoks.uj.edu.pl/resources/flugor/IRIX/xfs-whitepaper.html XFS: A Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O]: "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)." Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. * [Howard] [http://www.cs.cmu.edu/~satya/docdir/s11.pdf Scale and Performance in a Distributed File System], Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. * [Knuth] Knuth, D.E., [http://www-cs-faculty.stanford.edu/~knuth/taocp.html The Art of Computer Programming], Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith., [http://www.spec.org/sfs93/doc/WhitePaper.ps LADDIS: The Next Generation in NFS File Server Benchmarking], Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry [http://portal.acm.org/citation.cfm?id=548586 Data Structures & Their Algorithms], HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., [http://portal.acm.org/citation.cfm?id=359839 Pagination of B*-trees with variable length records], Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman], [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2970&rep=rep1&type=pdf Extent−like Performance from a UNIX File System]: The implementation of write-clustering for Sun's UFS. * [OLE] [http://portal.acm.org/citation.cfm?id=207534 Inside OLE] by Kraig Brockshmidt, discusses Structured Storage * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. [http://portal.acm.org/citation.cfm?id=323627.323631 A trace-driven analysis of the UNIX 4.2BSD file system]. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15-24, Orcas Island, WA, December 1985. * [NTFS] [http://portal.acm.org/citation.cfm?id=527752 Inside the Windows NT File System]. The book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. * [Peacock] Dr. J. Kent Peacock, "The CounterPoint Fast File System", Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf The Hideous Name], USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. His discussion of naming in plan 9: [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9] * [Rosenblum and Ousterhout] [http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf The Design and Implementation of a Log-Structured File System], Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse [http://www.eecs.harvard.edu/~margo/papers/icde93/ Transaction Support in a Log-Structured File System] and the arguments by Margo Seltzer it links to. * [Snyder] , [http://www.solarisinternals.com/si/reading/tmpfs.pdf tmpfs: A Virtual Memory File System] discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, [http://books.google.com/books?as_isbn=0131019082 UNIX internals: the new frontiers] [[category:ReiserFS]] [[category:Formatting-fixes-needed]] f59382e5ac27c9fc9da38e1d1f0c7a0c6032539e 1724 1723 2010-04-25T04:15:04Z Chris goe 2 moved [[ReiserFS]] to [[X0reiserfs]]:&#32;We'll use the ReiserFS article for something else {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = * G.M. Adel'son-Vel'skii and E.M. Landis, [http://en.scientificcommons.org/19884302 An algorithm for the organization of information], Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Apple Computer Inc, [http://books.google.com/books?as_isbn=0201177323 Inside Macintosh, Files], Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. * [Bach] Maurice J. Bach, [http://portal.acm.org/citation.cfm?id=8570 The Design of the Unix Operating System], 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: [http://portal.acm.org/citation.cfm?id=582353.582390 On Extending the Functions of a Relational Database System]. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. * [Chen] Chen, P.M. Patterson, David A., [http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6129.html A New Approach to I/O Performance Evaluation] -- Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, [http://www.ece.cmu.edu/~ganger/papers/cffs.html Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files]. A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. * [ext2fs] by Rémy Card, [http://e2fsprogs.sourceforge.net/ext2intro.html Design and Implementation of the Second Extended Filesystem]. Extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. * [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. [http://www.eecs.berkeley.edu/~brewer/cs262/FFS.pdf A fast file system for UNIX]. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) * [Ganger] Gregory R. Ganger, Yale N. Patt, [http://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers/softupdates-osdi94.pdf Metadata Update Performance in File Systems] * [Gifford], [http://portal.acm.org/citation.cfm?id=121133.121138 Semantic file systems]. Describes a file system enriched to have more than hierarchical semantics], he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. * [Hitz, Dave] [http://media.netapp.com/documents/wp_3002.pdf File System Design for an NFS File Server Appliance]. A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. * [Holton and Das] , Holton, Mike., Das, Raj., [http://www.uoks.uj.edu.pl/resources/flugor/IRIX/xfs-whitepaper.html XFS: A Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O]: "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)." Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. * [Howard] [http://www.cs.cmu.edu/~satya/docdir/s11.pdf Scale and Performance in a Distributed File System], Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. * [Knuth] Knuth, D.E., [http://www-cs-faculty.stanford.edu/~knuth/taocp.html The Art of Computer Programming], Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith., [http://www.spec.org/sfs93/doc/WhitePaper.ps LADDIS: The Next Generation in NFS File Server Benchmarking], Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry [http://portal.acm.org/citation.cfm?id=548586 Data Structures & Their Algorithms], HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., [http://portal.acm.org/citation.cfm?id=359839 Pagination of B*-trees with variable length records], Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman], [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2970&rep=rep1&type=pdf Extent−like Performance from a UNIX File System]: The implementation of write-clustering for Sun's UFS. * [OLE] [http://portal.acm.org/citation.cfm?id=207534 Inside OLE] by Kraig Brockshmidt, discusses Structured Storage * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. [http://portal.acm.org/citation.cfm?id=323627.323631 A trace-driven analysis of the UNIX 4.2BSD file system]. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15-24, Orcas Island, WA, December 1985. * [NTFS] [http://portal.acm.org/citation.cfm?id=527752 Inside the Windows NT File System]. The book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. * [Peacock] Dr. J. Kent Peacock, "The CounterPoint Fast File System", Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf The Hideous Name], USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. His discussion of naming in plan 9: [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9] * [Rosenblum and Ousterhout] [http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf The Design and Implementation of a Log-Structured File System], Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse [http://www.eecs.harvard.edu/~margo/papers/icde93/ Transaction Support in a Log-Structured File System] and the arguments by Margo Seltzer it links to. * [Snyder] , [http://www.solarisinternals.com/si/reading/tmpfs.pdf tmpfs: A Virtual Memory File System] discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, [http://books.google.com/books?as_isbn=0131019082 UNIX internals: the new frontiers] [[category:ReiserFS]] [[category:Formatting-fixes-needed]] e900bc003c9b8ddca9431bf2ca3be63de315c314 1723 1722 2010-04-25T04:10:50Z Chris goe 2 cleanup {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = * G.M. Adel'son-Vel'skii and E.M. Landis, [http://en.scientificcommons.org/19884302 An algorithm for the organization of information], Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Apple Computer Inc, [http://books.google.com/books?as_isbn=0201177323 Inside Macintosh, Files], Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. * [Bach] Maurice J. Bach, [http://portal.acm.org/citation.cfm?id=8570 The Design of the Unix Operating System], 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: [http://portal.acm.org/citation.cfm?id=582353.582390 On Extending the Functions of a Relational Database System]. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. * [Chen] Chen, P.M. Patterson, David A., [http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6129.html A New Approach to I/O Performance Evaluation] -- Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, [http://www.ece.cmu.edu/~ganger/papers/cffs.html Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files]. A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. * [ext2fs] by Rémy Card, [http://e2fsprogs.sourceforge.net/ext2intro.html Design and Implementation of the Second Extended Filesystem]. Extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. * [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. [http://www.eecs.berkeley.edu/~brewer/cs262/FFS.pdf A fast file system for UNIX]. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) * [Ganger] Gregory R. Ganger, Yale N. Patt, [http://pages.cs.wisc.edu/~remzi/Classes/838/Fall2001/Papers/softupdates-osdi94.pdf Metadata Update Performance in File Systems] * [Gifford], [http://portal.acm.org/citation.cfm?id=121133.121138 Semantic file systems]. Describes a file system enriched to have more than hierarchical semantics], he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. * [Hitz, Dave] [http://media.netapp.com/documents/wp_3002.pdf File System Design for an NFS File Server Appliance]. A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. * [Holton and Das] , Holton, Mike., Das, Raj., [http://www.uoks.uj.edu.pl/resources/flugor/IRIX/xfs-whitepaper.html XFS: A Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O]: "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)." Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. * [Howard] [http://www.cs.cmu.edu/~satya/docdir/s11.pdf Scale and Performance in a Distributed File System], Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. * [Knuth] Knuth, D.E., [http://www-cs-faculty.stanford.edu/~knuth/taocp.html The Art of Computer Programming], Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith., [http://www.spec.org/sfs93/doc/WhitePaper.ps LADDIS: The Next Generation in NFS File Server Benchmarking], Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry [http://portal.acm.org/citation.cfm?id=548586 Data Structures & Their Algorithms], HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., [http://portal.acm.org/citation.cfm?id=359839 Pagination of B*-trees with variable length records], Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman], [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2970&rep=rep1&type=pdf Extent−like Performance from a UNIX File System]: The implementation of write-clustering for Sun's UFS. * [OLE] [http://portal.acm.org/citation.cfm?id=207534 Inside OLE] by Kraig Brockshmidt, discusses Structured Storage * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. [http://portal.acm.org/citation.cfm?id=323627.323631 A trace-driven analysis of the UNIX 4.2BSD file system]. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15-24, Orcas Island, WA, December 1985. * [NTFS] [http://portal.acm.org/citation.cfm?id=527752 Inside the Windows NT File System]. The book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. * [Peacock] Dr. J. Kent Peacock, "The CounterPoint Fast File System", Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf The Hideous Name], USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. His discussion of naming in plan 9: [http://plan9.bell-labs.com/sys/doc/names.html The Use of Name Spaces in Plan 9] * [Rosenblum and Ousterhout] [http://www.eecs.berkeley.edu/~brewer/cs262/LFS.pdf The Design and Implementation of a Log-Structured File System], Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse [http://www.eecs.harvard.edu/~margo/papers/icde93/ Transaction Support in a Log-Structured File System] and the arguments by Margo Seltzer it links to. * [Snyder] , [http://www.solarisinternals.com/si/reading/tmpfs.pdf tmpfs: A Virtual Memory File System] discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, [http://books.google.com/books?as_isbn=0131019082 UNIX internals: the new frontiers] [[category:ReiserFS]] [[category:Formatting-fixes-needed]] e900bc003c9b8ddca9431bf2ca3be63de315c314 1722 1717 2010-04-25T03:34:58Z Chris goe 2 formatting fixes {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = * G.M. Adel'son-Vel'skii and E.M. Landis, [http://en.scientificcommons.org/19884302 An algorithm for the organization of information], Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. * [Apple] Apple Computer Inc, [http://books.google.com/books?as_isbn=0201177323 Inside Macintosh, Files], Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. * [Bach] Maurice J. Bach, [http://portal.acm.org/citation.cfm?id=8570 The Design of the Unix Operating System], 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. * [BLOB] R. Haskin, Raymond A. Lorie: [http://portal.acm.org/citation.cfm?id=582353.582390 On Extending the Functions of a Relational Database System]. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. * [Chen] Chen, P.M. Patterson, David A., [http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6129.html A New Approach to I/O Performance Evaluation] -- Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. * [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, [http://www.ece.cmu.edu/~ganger/papers/cffs.html Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files]. A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. * [ext2fs] by Rémy Card, [http://e2fsprogs.sourceforge.net/ext2intro.html extensive information], source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. * [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) * [Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only * [Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. * [Hitz, Dave]http://www.netapp.com/technology/level3/3002.html A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. * [Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. * [Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. * [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. * [LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 * [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. * [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. * [McVoy and Kleiman], the implementation of write-clustering for Sun's UFS. * [OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only * [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. * [NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. * [Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988 * [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html * [Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. * [Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. * [Vahalia] Uresh Vahalia, ``Unix Kernal Internals'' [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 96b04de012f3d71bd75201177513d3f1ffd39aa1 1717 1716 2010-04-25T03:12:16Z Chris goe 2 more refs {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in [http://en.wikisource.org/wiki/The_Wealth_of_Nations "The Wealth of Nations"] engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work (e.g. [http://plan9.bell-labs.com/sys/doc/names.html "The Use of Name Spaces in Plan9", Rob Pike] and [http://pdos.csail.mit.edu/~rsc/pike85hideous.pdf "The Hideous Name", Rob Pike and P.J. Weinberger]). Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. [Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. [ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) [Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only [Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. [Hitz, Dave]http://www.netapp.com/technology/level3/3002.html A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. [Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. [Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. [LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. [McVoy and Kleiman], the implementation of write-clustering for Sun's UFS. [OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. [NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. [Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988 [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html [Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. [Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. [Vahalia] Uresh Vahalia, ``Unix Kernal Internals'' [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 22dc4c441c12a3ce7aa78fa4b47bae28dc987cb0 1716 1709 2010-04-25T03:04:28Z Chris goe 2 reference added {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways (e.g. [http://plan9.bell-labs.com/sys/doc/names.html Pike, The Use of Name Spaces in Plan9]). None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in ``The Wealth of Nations'' engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work [e.g. Pike ``The Use of Name Spaces in Plan9'' and ``The Hideous Name'' http://magnum.cooper.edu:9000/ ~rp/html/rob.html. Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. [Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. [ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) [Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only [Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. [Hitz, Dave]http://www.netapp.com/technology/level3/3002.html A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. [Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. [Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. [LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. [McVoy and Kleiman], the implementation of write-clustering for Sun's UFS. [OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. [NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. [Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988 [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html [Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. [Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. [Vahalia] Uresh Vahalia, ``Unix Kernal Internals'' [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 9775994ca10696e09edfd52b056cc3f27af25dab 1709 1697 2010-04-25T02:55:38Z Chris goe 2 wayback template used {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways [e.g. Pike ``The Use of Name Spaces in Plan9'' ]. None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in ``The Wealth of Nations'' engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work [e.g. Pike ``The Use of Name Spaces in Plan9'' and ``The Hideous Name'' http://magnum.cooper.edu:9000/ ~rp/html/rob.html. Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. [Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. [ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) [Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only [Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. [Hitz, Dave]http://www.netapp.com/technology/level3/3002.html A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. [Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. [Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. [LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. [McVoy and Kleiman], the implementation of write-clustering for Sun's UFS. [OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. [NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. [Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988 [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html [Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. [Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. [Vahalia] Uresh Vahalia, ``Unix Kernal Internals'' [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 9759339db064e4169fad5a4f96cdbd2353888116 1697 1520 2010-04-25T02:22:56Z Chris goe 2 reorg Three reasons why ReiserFS is great for you Last Update: 2002 Hans Reiser Three reasons why ReiserFS is great for you: # ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. # ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. # ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. = Introduction = The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways [e.g. Pike ``The Use of Name Spaces in Plan9'' ]. None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. = Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? = An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in ``The Wealth of Nations'' engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work [e.g. Pike ``The Use of Name Spaces in Plan9'' and ``The Hideous Name'' http://magnum.cooper.edu:9000/ ~rp/html/rob.html. Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. = Should File Boundaries Be Block Aligned? = Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. = Balanced Trees and Large File I/O = There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. = Serialization and Consistency = The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. = Why Aggregate Small Objects at the File System Level? = There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. = Tree Definitions = Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). = Using the Tree to Optimize Layout of Files = There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. == Physical Layout == This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. == Node Layout == When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. == Ordering within the Tree == While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. == Node Balancing Optimizations == When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. === Drops === (The difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. === Code Complexity === I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. = Buffering & the Preserve List = We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. = Lessons From Log Structured File Systems = Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. = Directions For the Future = To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. = Conclusion = Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. = Acknowledgments = Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. = Business Model and Licensing = I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. = References = G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. [Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. [ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) [Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only [Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. [Hitz, Dave]http://www.netapp.com/technology/level3/3002.html A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. [Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. [Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. [LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. [McVoy and Kleiman], the implementation of write-clustering for Sun's UFS. [OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. [NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. [Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988 [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html [Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. [Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. [Vahalia] Uresh Vahalia, ``Unix Kernal Internals'' [[category:ReiserFS]] [[category:Formatting-fixes-needed]] fdfd901015fc044138d1be28025f56ab3036f43b 1520 1366 2009-06-27T19:20:59Z Chris goe 2 [[category:Formatting-fixes-needed]] Three reasons why ReiserFS is great for you: ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. Table of Contents: 1. Introduction 2. Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? 3. Should File Boundaries Be Block Aligned? 4. Balanced Trees and Large File I/O 5. Serialization and Consistency 6. Why Aggregate Small Objects at the File System Level? 7. Tree Definitions 8. Using the Tree to Optimize Layout of Files 1. Physical Layout 2. Node Layout 3. Ordering within the Tree 4. Node Balancing Optimizations 1. Drops (the difficult design issues in the current version that our next version can do better) 2. Code Complexity 9. Buffering & the Preserve List 10. Lessons From Log Structured File Systems 11. Directions For the Future 12. Conclusion 13. Acknowledgments 14. Business Model and Licensing 15. References Introduction The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways [e.g. Pike ``The Use of Name Spaces in Plan9'' ]. None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in ``The Wealth of Nations'' engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work [e.g. Pike ``The Use of Name Spaces in Plan9'' and ``The Hideous Name'' http://magnum.cooper.edu:9000/ ~rp/html/rob.html. Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. Should File Boundaries Be Block Aligned? Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. Balanced Trees and Large File I/O There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. Serialization and Consistency The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. Why Aggregate Small Objects at the File System Level? There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. Tree Definitions Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). Using the Tree to Optimize Layout of Files There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. Physical Layout This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. Node Layout When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. Ordering within the Tree While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. Node Balancing Optimizations When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. Drops (the difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. Code Complexity I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. Buffering & the Preserve List We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. Lessons From Log Structured File Systems Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. Directions For the Future To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. Conclusion Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. Acknowledgments Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. Business Model and Licensing I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. References G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. [Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. [ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) [Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only [Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. [Hitz, Dave]http://www.netapp.com/technology/level3/3002.html A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. [Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. [Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. [LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. [McVoy and Kleiman], the implementation of write-clustering for Sun's UFS. [OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. [NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. [Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988 [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html [Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. [Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. [Vahalia] Uresh Vahalia, ``Unix Kernal Internals'' [[category:ReiserFS]] [[category:Formatting-fixes-needed]] 8d72fd335c6860926e0bc65cf0944e783b969976 1366 1318 2009-06-25T09:09:53Z Chris goe 2 category added Three reasons why ReiserFS is great for you: ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. Table of Contents: 1. Introduction 2. Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? 3. Should File Boundaries Be Block Aligned? 4. Balanced Trees and Large File I/O 5. Serialization and Consistency 6. Why Aggregate Small Objects at the File System Level? 7. Tree Definitions 8. Using the Tree to Optimize Layout of Files 1. Physical Layout 2. Node Layout 3. Ordering within the Tree 4. Node Balancing Optimizations 1. Drops (the difficult design issues in the current version that our next version can do better) 2. Code Complexity 9. Buffering & the Preserve List 10. Lessons From Log Structured File Systems 11. Directions For the Future 12. Conclusion 13. Acknowledgments 14. Business Model and Licensing 15. References Introduction The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways [e.g. Pike ``The Use of Name Spaces in Plan9'' ]. None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in ``The Wealth of Nations'' engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work [e.g. Pike ``The Use of Name Spaces in Plan9'' and ``The Hideous Name'' http://magnum.cooper.edu:9000/ ~rp/html/rob.html. Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. Should File Boundaries Be Block Aligned? Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. Balanced Trees and Large File I/O There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. Serialization and Consistency The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. Why Aggregate Small Objects at the File System Level? There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. Tree Definitions Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). Using the Tree to Optimize Layout of Files There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. Physical Layout This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. Node Layout When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. Ordering within the Tree While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. Node Balancing Optimizations When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. Drops (the difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. Code Complexity I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. Buffering & the Preserve List We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. Lessons From Log Structured File Systems Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. Directions For the Future To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. Conclusion Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. Acknowledgments Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. Business Model and Licensing I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. References G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. [Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. [ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) [Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only [Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. [Hitz, Dave]http://www.netapp.com/technology/level3/3002.html A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. [Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. [Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. [LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. [McVoy and Kleiman], the implementation of write-clustering for Sun's UFS. [OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. [NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. [Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988 [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html [Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. [Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. [Vahalia] Uresh Vahalia, ``Unix Kernal Internals'' [[category:ReiserFS]] ae72df26330503c54d11d94f4ba1abceea344a08 1318 2009-06-25T07:20:39Z Chris goe 2 http://web.archive.org/web/20061113154555/http://www.namesys.com/X0reiserfs.html Three reasons why ReiserFS is great for you: ReiserFS has fast journaling, which means that you don't spend your life waiting for fsck every time your laptop battery dies, or the UPS for your mission critical server gets its batteries disconnected accidentally by the UPS company's service crew, or your kernel was not as ready for prime time as you hoped, or the silly thing decides you mounted it too many times today. ReiserFS is based on fast balanced trees. Balanced trees are more robust in their performance, and are a more sophisticated algorithmic foundation for a file system. When we started our project, there was a consensus in the industry that balanced trees were too slow for file system usage patterns. We proved that if you just do them right they are better--take a look at the benchmarks. We have fewer worst case performance scenarios than other file systems and generally better overall performance. If you put 100,000 files in one directory, we think its fine; many other file systems try to tell you that you are wrong to want to do it. ReiserFS is more space efficient. If you write 100 byte files, we pack many of them into one block. Other file systems put each of them into their own block. We don't have fixed space allocation for inodes. That saves 6% of your disk. Ok, it's time to fess up. The interesting stuff is still in the future. Because they are nifty, we are going to add database and hypertext like features into the file system. Only by using balanced trees, with their effective handling of small files (database small fields, hypertext keywords), as our technical foundation can we hope to do this. That was our real motivation. As for performance, we may already be slightly better than the traditional file systems (and substantially better than the journaling ones). But they have been tweaking for decades, while we have just got started. This means that over the next few years we are going to improve faster than they are. Speaking more technically: ReiserFS is a file system using a plug-in based object oriented variant on classical balanced tree algorithms. The results when compared to the ext2fs conventional block allocation based file system, running under the same operating system and employing the same buffering code, suggest that these algorithms are overall more efficient and every passing month are becoming yet more so. Loosely speaking, every month we find another performance cranny that needs work; we fix it. And every month we find some way of improving our overall general usage performance. The improvement in small file space and time performance suggests that we may now revisit a common OS design assumption that one should aggregate small objects using layers above the file system layer. Being more effective at small files does not make us less effective for other files. This is truly a general purpose FS. Our overall traditional FS usage performance is high enough to establish that. ReiserFS has a commitment to opening up the FS design to contributions; we are now adding plug-ins so that you can create your own types of directories and files. Table of Contents: 1. Introduction 2. Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? 3. Should File Boundaries Be Block Aligned? 4. Balanced Trees and Large File I/O 5. Serialization and Consistency 6. Why Aggregate Small Objects at the File System Level? 7. Tree Definitions 8. Using the Tree to Optimize Layout of Files 1. Physical Layout 2. Node Layout 3. Ordering within the Tree 4. Node Balancing Optimizations 1. Drops (the difficult design issues in the current version that our next version can do better) 2. Code Complexity 9. Buffering & the Preserve List 10. Lessons From Log Structured File Systems 11. Directions For the Future 12. Conclusion 13. Acknowledgments 14. Business Model and Licensing 15. References Introduction The author is one of many OS researchers who are attempting to unify the name spaces in the operating system in varying ways [e.g. Pike ``The Use of Name Spaces in Plan9'' ]. None of us are well funded compared with the size of the task, and I am far from an exception to this rule. The natural consequence is that we each have attacked one small aspect of the task. My contribution is in incorporating small objects into the file system name space effectively. This implementation offers value to the average Linux user, in that it offers generally good performance compared to the current Linux file system known as ext2fs.It also saves space to an extent that is important for some applications, and convenient for most. It does extremely well for large directories, and has a variety of minor advantages. Since ext2fs is very similar to FFS and UFS in performance, the implementation also offers potential value to commercial OS vendors who desire greater than ext2fs performance without directory size issues, and who appreciate the value of a better foundation for integrating name spaces throughout the OS. Why Is There A Move Among Some OS Designers Towards Unifying Name Spaces? An operating system is composed of components that access other components through interfaces. Operating systems are complex enough that, like national economies, the architect cannot centrally plan the interactions of the components that it is composed of. The architect can provide a structural framework that has a marked impact on the efficiency and utility of those interactions. Economists have developed principles that govern large economic systems. Are there system principles that we might use to start a discussion of the ways increasing component interactivity via naming system design impacts the total utility of an operating system? I propose these: * If one increases the number of other components that a particular component can interact with, one increases its expressive power and thereby its utility. * One can increase the number of other components that a particular component can interact with either by increasing the number of interfaces it has, or by increasing the number of components that are accessible by its current interfaces. * The cost of component interfaces dominates software design cost., like the cost of wires dominates circuit design cost. * Total system utility tends to be proportional not to the number of components, but to the number of possible component interactions. It is not simply the number of components that one has that determines an OS's expressive power, it is the number of opportunities to use them that determines it. The number of these opportunities are proportional to the number of possible combinations of them, and the number of possible combinations of them are determined by their connectedness. Component connectedness in OS design is determined by name space design, to much the same extent that buses determine it in circuit design. Allow me to illustrate the impact of these principles with the use of an imaginary example. Suppose two imaginary OS vendors with equally talented programmers hire two different OS architects. Suppose one of the architects centers the OS design around a single name space design that allows all of the components to access all other components via a single interface (assume this is possible, it is a theoretical example). Suppose the other allows the ten different design groups in the company that are developing components to create their own ten name spaces. Suppose that the unified name space OS architect has half of the resources of the fragmented name space OS architect and creates half as many components. While the number of components is half as large, the number of connections is 1/22/((1/102)*10) times larger. If you accept my hypothesis that utility is more proportional to connections than components, then the unified operating system with half the development cost will still offer more expressive utility. That is a powerful motivation. To return briefly to the long ago researched principles governing another member of the class of large systems, the economies of nations, it is perhaps interesting to note that Adam Smith in ``The Wealth of Nations'' engaged in substantial study of the link between the extent of interconnectedness and the development of civilization, where the extent of interconnectedness was determined by waterways, etc. The link he found for economic systems was no less crucial than what is being suggested here for the effect of component interconnectedness on the total utility of software systems. I suggest that I am merely generalizing a long established principle from another field of science, namely that total utility in large systems with components that interact to generate utility is determined by the extent of their interconnection. There are many exceptions to these principles: not all chips on a motherboard sit on the bus, and analogous considerations apply to both OS design and the economies of nations. I hope the reader will accept that space considerations make it appropriate to gloss over these, and will consider the central point that under some circumstances unifying name spaces in a design can dramatically improve the utility of an OS. That can be an enormous motivation, and it has moved a number of OS researchers in their work [e.g. Pike ``The Use of Name Spaces in Plan9'' and ``The Hideous Name'' http://magnum.cooper.edu:9000/ ~rp/html/rob.html. Unfortunately, it is not a small technical effort to combine name spaces. To combine 10 name spaces requires, if not the effort to create 10 name spaces, perhaps an effort equivalent to creating 5 of the name spaces. Usually each of the name spaces has particular performance and semantic power requirements that require enhancing the unified name space, and it usually requires technical innovation to combine the advantages of each of the separated name spaces into a unified name space. I would characterize none of the research groups currently approaching this unification problem as having funding equivalent to what went into creating 5 of the name spaces they would like to unify, and we are certainly no exception. For this reason I have picked one particular aspect of this larger problem for our focus: allowing small objects to effectively share the same file system interface that large objects use currently. As operating systems increase the number of their components, the higher development cost of a file system able to handle small files becomes more worth the multiplicative effect it has on OS utility, as well as its reduction of OS component interface cost. Should File Boundaries Be Block Aligned? Making file boundaries block aligned has a number of effects: it minimizes the number of blocks a file is spread across (which is especially beneficial for multiple block files when locality of reference across files is poor), it wastes disk and buffer space in storing every less than fully packed block, it wastes I/O bandwidth with every access to a less than fully packed block when locality of reference is present, it increases the average number of block fetches required to access every file in a directory, and it results in simpler code. The simpler code of block aligning file systems follows from not needing to create a layering to distinguish the units of the disk controller and buffering algorithms from the units of space allocation, and from not needing to optimize the packing of nodes as is done in balanced tree algorithms. For readers who have not been involved in balanced tree implementations, algorithms of this class are notorious for being much more work to implement than one would expect from their description. Sadly, they also appear to offer the highest performance solution for small files, once I remove certain simplifications from their implementation and add certain optimizations common to file system designs. I regret that code complexity (30k lines) is a major disadvantage of the approach compared to the 6k lines of the ext2fs approach. I started our analysis of the problem with an assumption that I needed to aggregate small files in some way, and that the question was, which solution was optimal? The simplest solution was to aggregate all small files in a directory together into either a file or the directory. But any aggregation into a file or directory wastes part of the last block in the aggregation. What does one do if there are only a few small files in a directory, aggregate them into the parent of the directory? What if there are only a few small files in a directory at first, and then there are many small files, how do I decide what level to aggregate them at, and when to take them back from a parent of a directory and store them directly in the directory. As we did our analysis of these questions we realized that this problem was closely related to the balancing of nodes in a balanced tree. The balanced tree approach, by using an ordering of files which are then dynamically aggregated into nodes at a lower level, rather than a static aggregation or grouping, avoids this set of questions. In my approach I store both files and filenames in a balanced tree, with small files, directory entries, inodes, and the tail ends of large files all being more efficiently packed as a result of relaxing the requirements of block alignment, and eliminating the use of a fixed space allocation for inodes. I have a sophisticated and flexible means for arranging for the aggregation of files for maximal locality of reference, through defining the ordering of items in the tree. The body of large files is stored in unformatted nodes that are attached to the tree but isolated from the effects of possible shifting by the balancing algorithms. Approaches such as [Apple] and [Holton and Das] have stored filenames but not files in balanced trees. None of the file systems C-FFS, NTFS, or XFS aggregate files, all of them block align files, though all of those also do some variation on storing small files in the statically allocated block address fields of inodes if they are small enough to fit there.[C-FFS] has published an excellent discussion of both their approach and why small files rob a conventional file system of performance more in proportion to the number of small files than the number of bytes consumed by small files. However, I must note that their notion of what constitutes small is different from ours by one or two orders of magnitude. Their use of an exo-kernel is simply an excellent approach for operating systems that have that as an available option. Semantics (files), packing (blocks/nodes), caching(read ahead sizes, etc.), and the hardware interfaces of disk (sectors) and paging (pages) all have different granularity issues associated with them: a central point of our approach is that the optimal granularity of these often differs, and abstracting these into separate layers in which the granularity of one layer does not unintentionally impact other layers can improve space/time performance. Reiserfs innovates in that its semantic layer often conveys to the other layers an ungranulated ordering rather than one granulated by file boundaries. The reader is encouraged to note the areas in which reiserfs needs to go farther in its doing so while reading the algorithms. Balanced Trees and Large File I/O There has long been an odd informal consensus that balanced trees are too slow for use in storing large files, perhaps originating in the performance of databases that have attempted to emulate file systems using balanced tree algorithms that were not originally architected for file system access patterns or their looser serialization requirements. It is hopefully easy for the reader to understand that storing many small files and tail ends of files in a single node where they can all be fetched in one I/O leads directly to higher performance. Unfortunately, it is quite complex to understand the interplay between I/O efficiency and block size for larger files, and space does not allow a systematic review of traditional approaches. The reader is referred to [FFS], [Peacock], [McVoy], [Holton and Das], [Bach], [OLE], and [NTFS] for treatments of the topic, and discussions of various means of 1) reducing the effect of block size on CPU efficiency, 2) eliminating the need for inserting rotational delay between successive blocks, 3) placing small files into either inodes or directories, and 4) performing read-ahead. More commentary on these is in the annotated bibliography. Reiserfs has the following architectural weaknesses that stem directly from the overhead of repacking to save space and increase block size: 1) when the tail (files < 4k are all tail) of a file grows large enough to occupy an entire node by itself it is removed from the formatted node(s) it resides in, and it is converted into an unformatted node ([FFS] pays a similar conversion cost for fragments), 2) a tail that is smaller than one node may be spread across two nodes which requires more I/O to read if locality of reference is poor, 3) aggregating multiple tails into one node introduces separation of file body from tail, which reduces read performance ([FFS] has a similar problem, and for reiserfs files near the node in size the effect can be significant), 4) when you add one byte to a file or tail that is not the last item in a formatted node, then on average half of the whole node is shifted in memory. If any of your applications perform I/O in such a way that they generate many small unbuffered writes, reiserfs will make you pay a higher price for not being able to buffer the I/O. Most applications that create substantial file system load employ effective I/O buffering, often simply as a result of using the I/O functions in the standard C libraries. By avoiding accesses in small blocks/extents reiserfs improves I/O efficiency. Extent based file systems such as VxFS, and write-clustering systems such as ext2fs, are not so effective in applying these techniques that they choose to use 512-byte blocks rather than 1k blocks as their defaults. Ext2fs reports a 20% speedup when 4k rather than 1k blocks are used, but the authors of ext2fs advise the use of 1k blocks to avoid wasting space. There are a number of worthwhile large file optimizations that have not been added to either ext2fs or reiserfs, and both file systems are somewhat primitive in this regard, reiserfs being the more primitive of the two. Large files simply were not my research focus, and it being a small research project I did not implement the many well known techniques for enhancing large file I/O. The buffering algorithms are probably more crucial than any other component in large file I/O, and partly out of a desire for a fair comparison of the approaches I have not modified these. I have added no significant optimizations for large files, beyond increasing the block size, that are not found in ext2fs. Except for the size of the blocks, there is not a large inherent difference between: 1) the cost of adding a pointer to an unformatted node to my tree plus writing the node, and 2) adding an address field to an inode plus writing the block. It is likely that except for block size the primary determinants of high performance large file access are orthogonal to the decision of whether to use balanced tree algorithms for small and medium sized files. For large files we get some advantage from not having our tree being more balanced than the tree formed by an inode which points to a triple indirect block. We haven't an easy method for measuring the performance gain from that though. There is performance overhead due to the memory bandwidth cost of balancing nodes for small files. We think it is worth it though. Serialization and Consistency The issues of ensuring recoverability with minimal serialization and data displacement necessarily dominate high performance design. Let's define the two extremes in serialization so that the reason for this can be clear. Consider the relative speed of a set of I/O's in which every block request in the set is fed to the elevator algorithms of the kernel and the disk drive firmware fully serially, each request awaiting the completion of the previous request.; Now consider the other extreme, in which all block requests are fed to the elevator algorithms all together, so that they may all be sorted and performed in close to their sorted order (disk drive firmwares don't use a pure elevator algorithm). The unserialized extreme may be more than an order of magnitude faster, due to the cost of rotations and seeks. Unnecessarily serializing I/O prevents the elevator algorithm from doing its job of placing all of the I/O's in their layout sequence rather than chronological sequence. Most of high performance design centers around making I/O's in the order they are laid out on disk, and laying out blocks on disk in the order that the I/O's will want to be issued. Snyder discusses a file system that obtains high performance from a complete lack of disk synchronization, but is only suitable for temporary files that don't need to survive reboot. I think its known value to Solaris users indicates that the optimal buffering policy varies from file to file. Ganger discusses methods for using ordering of writes rather than serialization for ensuring conventional file system meta-data integrity, [McVoy] previously suggested but did not implement ordering of buffer writes. Ext2fs is fast in substantial part due to avoiding synchronous writes of metadata, and I have much personal experience with it that leads me to prefer compiles that are fast. [ I would like to see it adopt a policy that all dirty buffers for files not flagged as temporary are queued for writing, and that the existence of a dirty buffer means that the disk is busy. This will require replacing buffer I/O locking with copy-on-write, but an idle disk is such a terrible thing to waste.:-) ] [NTFS] by default adds unnecessary serialization to an extent that even older file systems such as [FFS] do not, and its performance characteristics reflect that. In fairness, it should be said that it is the superior approach for most removable media without software control of ejection (e.g. IBM PC floppies). Reiserfs employs a new scheme called preserve lists for ensuring recoverability, which avoids overwriting old meta-data by writing the meta-data nearby rather than over old meta-data. Why Aggregate Small Objects at the File System Level? There has long been a tradition of file system developers deciding that effective handling of small files is not significant to performance, and the application programmers caring enough about performance for small files to not store them as separate entities in the file system. To store small objects one may either make the file system efficient for the task, or sidestep the problem by aggregating small objects in a layer above the file system. Sidestepping the problem has three disadvantages: utility, code complexity, and performance. Utility and Code Complexity: Allowing OS designers to effectively use a single namespace with a single interface for both large and small objects decreases coding cost and increase expressive power of components throughout the OS. I feel reiserfs shows the effects of a larger development investment focused on a simpler interface when compared with many solutions for this currently available in the object oriented toolkit community, such as the Structured Storage available in Microsoft's [OLE]. By simpler I mean I added nothing to the file system API to distinguish large and small objects, and I leave it to the directory semantics and archiving programs to aggregate objects. Multiple layers cost more to implement, cost more to code the interfaces for utilizing, and provide less flexibility. Performance: It is most commonly the case that when one layers one file system on top of another the performance is substantially reduced, and Structured Storage is not an exception to this general rule. Reiserfs, which does not attempt to delegate the small object problem to a layer above, avoids this performance loss. I have heard it suggested by some that this layering avoids the performance loss from syncing on file close as many file systems do. I suggest that this is adding an error to an error rather than fixing it. Let me make clear that I believe those who write such layers above the file system do not do so out of stupidity. I know of at least one company at which a solution that layers small object storage above the file system exists because the file system developers refused to listen to the non-file system group's description of its needs, and the file system group had to be sidestepped in generating the solution. Current file systems are fairly well designed for the purposes that their users currently use them for: my goal is to change file size usage patterns. The author remembers arguments that once showed clearly that there was no substantial market need for disk drives larger than 10MB based on current usage statistics. While [C-FFS] points out that 80% of file accesses are to files below 10k, I do not believe it reasonable to attempt to provide statistics based on usage measurements of file systems for which small files are inappropriate to use that will show that small files are frequently used. Application programmers are smarter than that. Currently 80% of file accesses are to the first order of magnitude in file size for which it is currently sensible to store the object in the file system. I regret that one can only speculate as to whether once file systems become effective for small files and database tasks, usage patterns will change to 80% of file accesses being to files less than 100 bytes. What I can do is show via the 80/20 Banded File Set Benchmark presented later that in such circumstances small file performance potentially dominates total system performance. In summary, the on-going reinvention of incompatible object aggregation techniques above the file system layer is expensive, less expressive, less integrated, slower, and less efficient in its storage than incorporating balanced tree algorithms into the file system. Tree Definitions Balanced trees are used in databases, and more generally, wherever a programmer needs to search and store to non-random memory by a key, and has the time to code it this way. The usual evolution for programmers is to first think that hashing will be simpler and more efficient, and then realize only after getting into the sordid details of it that the combination of space efficiency, minimizing disk accesses, and the feasibility of caching tho top part of the tree, makes the tree approach more effective. It is the usual thing to first try to do hashing, and then by the time the details are worked out, to have a balanced tree. The cost of effectively handling bucket overflow just isn't less than the cost of balancing, unless the buchets are always all in RAM. Hashing is often a good solution when there is no non-random memory involved, such as when hashing a cache. The Linux dcache code uses hashing for accessing a cache of in-memory directory entries. Sometimes one uses partial or full hashing of keys within that balanced tree. If you do full hashing within a tree, and you cache the top part of that tree, you have something rather similar to extensible hashing, except it is more flexible and efficient. Sometimes programmers code using unbalanced trees. Most filesystems do essentially that. Balanced trees generally do a better job of minimizing the average number of disk accesses. There is literature establishing that balanced trees are optimal for the worst case when there is no caching of the tree. This is rather pointless literature, as the average case when cached is what is important, and I am afraid that the existing literature proves that which is feasible to prove rather than that which is relevant. That said, practitioners know from experience that making the tree less balanced leads to more I/Os. Discussions of the exceptions to this are rather interesting but not for here.... I regret that I must assume that the reader is familiar with basic balanced tree algorithms [Wood], [Lewis and Denenberg], [Knuth], [McCreight]. No attempt will be made to survey tree design here since balanced trees are one of the most researched and complex topics in algorithm theory, and require treatment at length. I must compound this discourtesy with a concise set of definitions that sorely lack accompanying diagrams, my apologies. Finally, I'll truly annoy the reader by saying that the header files contain nice ascii art, and if you want full definition of the structures, the source is the place. Classically, balanced trees are designed with the set of keys assumed to be defined by the application, and the purpose of the tree design is to optimize searching through those keys. In my approach the purpose of the tree is to optimize the reference locality and space efficient packing of objects, and the keys are defined as best optimizes the algorithm for that. Keys are used in place of inode numbers in the file system, thereby choosing to substitute a mapping of keys to node location (the internal nodes) for a mapping of inode number to file location. Keys are longer than inode numbers, but one needs to cache fewer of them than one would need to cache inode numbers when more than one file is stored in a node. In my tree, I still require that a filename be resolved one component at a time. It is an interesting topic for future research whether this is necessary or optimal. This is more complex of an issue than a casual reader might realize: directory at a time lookup accomplishes a form of compression, makes mounting other name spaces and file system extensions simpler, makes security simpler, and makes future enhanced semantics simpler. Since small files typically lead to large directories, it is fortuitous that as a natural consequence of our use of tree algorithms, our directory mechanisms are much more effective for very large directories than most other file systems are (notable exceptions include [Holton and Das]). The tree has three node types: internal nodes, formatted nodes, and unformatted nodes. The contents of internal and formatted nodes are sorted in the order of their keys. (Unformatted nodes contain no keys.) Internal nodes consist of pointers to sub-trees separated by their delimiting keys. The key that precedes a pointer to a sub-tree is a duplicate of the first key in the first formatted node of that sub-tree. Internal nodes exist solely to allow determining which formatted node contains the item corresponding to a key. ReiserFS starts at the root node, examines its contents, and based on it can determine which subtree contains the item corresponding to the desired key. From the root node reiserfs descends into the tree, branching at each node, until it reaches the formatted node containing the desired item. The first (bottom) level of the tree consists of unformatted nodes, the second level consists of formatted nodes, and all levels above consist of internal nodes. The highest level contains the root node. The number of levels is increased as needed by adding a new root node at the top of the tree. All paths from the root of the tree to all formatted leaves are equal in length, and all paths to all unformatted leaves are also equal in length and 1 node longer than the paths to the formatted leaves. This equality in path length, and the high fanout it provides is vital to high performance, and in the Drops section I will describe how the lengthening of the path length that occurred as a result of introducing the [BLOB] approach (the use of indirect items and unformatted nodes) proved a measurable mistake. Formatted nodes consist of items. Items have four types: direct items, indirect items, directory items, and stat data items. All items contain a key which is unique to the item. This key is used to sort, and find, the item. Direct items contain the tails of files, and tails are the last part of the file (the last file_size modulo FS block size of a file). Indirect items consist of pointers to unformatted nodes. All but the tail of the file is contained in its unformatted nodes. Directory items contain the key of the first directory entry in the item followed by a number of directory entries. Depending on the configuration of reiserfs, stat data may be stored as a separate item, or it may be embedded in a directory entry. We are still benchmarking to determine which way is best. A file consists of a set of indirect items followed by a set of up to two direct items, with the existence of two direct items representing the case when a tail is split across two nodes. If a tail is larger than the maximum size of a file that can fit into a formatted node but is smaller than the unformatted node size (4k), then it is stored in an unformatted node, and a pointer to it plus a count of the space used is stored in an indirect item. Directories consist of a set of directory items. Directory items consist of a set of directory entries. Directory entries contain the filename and the key of the file which is named. There is never more than one item of the same item type from the same object stored in a single node (there is no reason one would want to use two separate items rather than combining). The first item of a file or directory contains its stat data. When performing balancing, and analyzing the packing of the node and its two neighbors, we ensure that the three nodes cannot be compressed into two nodes. I feel greater compression than this is best left to an FS cleaner to perform rather than attempting it dynamically. ReiserFS structures ReiserFS Tree has Max_Height = N (current default value for N = 5): The tree lais in the disk blocks. Each disk blocks that belongs the reiserfs tree has Block Head The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N ................Free Space................ Item N --- Item 2 Item 1 Item 0 The disk Block (Unformatted Node of the tree is the place for the data of the big file) ......................................................................................................................................................................................................... ReiserFS objects: Files, Directories Max number of objects = 2^32-4 = 4 294 967 292 Each object is a number of items : Files items : 1. StatData item + [Direct item] (for small files : size from 0 bytes to MAX_DIRECT_ITEM_LEN=blocksize-112 bytes) 2. StatData item + InDirect item + [Direct item] (for big files : size > MAX_DIRECT_ITEM_LEN bytes) Directory items : 1. StatData item + Directory item Every reiserfs object has Object ID and Key . Internal Node structures The disk Block (Internal Node of the tree is the place for keys and pointers to disk blocks) Block_Head Key 0 Key 1 Key 2 --- Key N Pointer 0 Pointer 1 Pointer 2 --- Pointer N Pointer N+1 ..Free Space.. struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 6 or 22 (6) 8 bytes for internal nodes ; (22) 24 bytes for leaf nodes struct key Field Name Type Size (in bytes) Description k_dir_id __u32 4 ID of the parent directory k_object_id __u32 4 ID of the object (also it is the number of inode) k_offset __u32 4 Offset from beginning of the object to the current byte of the object k_uniqueness __u32 4 Type of the item (StatData = 0, Direct = -1, InDirect = -2, Directory = 500) total 16 16 bytes struct disk_child (Pointer to disk block) Field Name Type Size (in bytes) Description dc_block_number unsigned long 4 Disk child's block number. dc_size unsigned short 2 Disk child's used space. total 6 (6) 8 bytes Leaf Node structures The disk Block (Leaf Node of the tree is the place for the Items and Items headers) Block_Head IHead 0 IHead 1 IHead 2 --- IHead N .............Free Space............. Item N --- Item 2 Item 1 Item 0 struct block_head Field Name Type Size (in bytes) Description blk_level unsigned short 2 Level of block in the tree ( 1-leaf; 2,3,4,... - internal; blk_nr_item unsigned short 2 Number of Keys in an Internal block. Or Number of Items in a Leaf block. blk_free_space unsigned short 2 Block Free Space in bytes blk_right_delim_key struct key 16 Right delimiting key for this block (for Leaf nodes only) total 22 (22) 24 bytes for leaf nodes Everything in the filesystem is stored as a set of items. Each item has its item_head. The item_head contains the key of the item, its free space (for indirect items) and specifies the location of the item itself within the block. struct item_head (IHead) Field Name Type Size (in bytes) Description ih_key struct key 16 Key to search the item. All item headers is sorted by this key u.ih_free_space u.ih_entry_count __u16 2 Free space in the last unformatted node for an InDirect item; 0xFFFF for a Direct item ; 0xFFFF for a Stat Data item. The number of directory entries for a Directory item. ih_item_len __u16 2 total size of the item body ih_item_location __u16 2 an offset to the item body within the block ih_reserved __u16 2 used by reiserfsck total 24 24 bytes There are 4 types of items: stat_data item, directory item, indirect item, direct item. struct stat_data (reiserfs version of UFS disk inode minus the address blocks) Field Name Type Size (in bytes) Description sd_mode __u16 2 file type, permissions sd_nlink __u16 2 number of hard links sd_uid __u16 2 owner id sd_gid __u16 2 group id sd_size __u32 4 file size sd_atime __u32 4 time of last access sd_mtime __u32 4 time file was last modified sd_ctime __u32 4 time inode (stat data) was last changed (except changes to sd_atime and sd_mtime) sd_rdev __u32 4 device sd_first_direct_byte __u32 4 Offset from the beginning of the file to the first byte of direct item of the file. ( -1) for directory ( 1) for small files (file has direct items only) ( >1) for big files (file has indirect and direct items) ( -1) for big files (file has indirect, but has not direct item) total 32 32 bytes Directory item : deHead 0 deHead 1 deHead 2 --- deHead N fileName N --- fileName 2 fileName 1 fileName 0 Direct item : ........................Small File Body............................ InDirect item : unfPointer 0 unfPointer 1 unfPointer 2 --- unfPointer N unfPointer - pointer to unformatted block (unfPointer size = 4 bytes). Unfomatted blocks contain the body of a big file. struct reiserfs_de_head (deHead) Field Name Type Size (in bytes) Description deh_offset __u32 4 third component of the directory entry key (all reiserfs_de_head sorted by this value) deh_dir_id __u32 4 objectid of the parent directory of the object, that is referenced by directory entry deh_objectid __u32 4 objectid of the object, that is referenced by directory entry deh_location __u16 2 offset of name in the whole item deh_state __u16 2 1) entry contains stat data (for future) 2) entry is hidden (unlinked) total 16 16 bytes fileName - the name of the file (array of bytes of variable length). Max length of file name = blocksize - 64 (for 4kb blocksize Max name length = 4032 bytes). Using the Tree to Optimize Layout of Files There are four levels at which layout optimization is performed: 1) the mapping of logical block numbers to physical locations on disk 2) the assigning of nodes to logical block numbers, 3) the ordering of objects within the tree, and 4) the balancing of the objects across the nodes they are packed into. Physical Layout This is performed by the disk drive manufacturer for SCSI, for IDE drives this logical block numbers to physical location mapping is done by the device driver, and for all drives it is also potentially done by volume management software. The logical block number to physical location mapping by the drive manufacturer is usually done using cylinders. I agree with the authors of [ext2fs] and most others that the significant file placement feature for FFS was not the actual cylinder boundaries, but placing files and their inodes on the basis of their parent directory location. FFS used explicit knowledge of actual cylinder boundaries in its design. I find that minimizing the distance in logical blocks of semantically adjacent nodes without tracking cylinder boundaries accomplishes an excellent approximation of optimizing according to actual cylinder boundaries, and I find its simplicity an aid to implementation elegance. Node Layout When I place nodes of the tree on the disk, I search for the first empty block in the bitmap (of used block numbers) which I will find if I start at the location of the left neighbor of the node in the tree ordering, and move in the direction I last moved in.. This was experimentally found to be better than the following alternatives for the benchmarks employed: 1) taking the first non-zero entry in the bitmap, 2) taking the entry after the last one that was assigned in the direction last moved in (this was 3% faster for writes and 10-20% slower for subsequent reads), 3) starting at the left neighbor and moving in the direction of the right neighbor. When changing block numbers for the purpose of avoiding overwriting sending nodes before shifted items reach disk in their new recipient node (see description of preserve lists later in paper), the benchmarks employed were ~10% faster when starting the search from the left neighbor rather than the node's current block number, even though it adds significant overhead to determine the left neighbor (the current implementation risks I/O to read the parent of the left neighbor). It used to be that we would reverse direction when we reached the end of the disk drive. Fortunately we checked to see if it makes a difference which direction one moves in when allocating blocks to a file, and indeed we found it made a significant difference to always allocate in the increasing block number direction. We hypothesize that this is due to matching disk spin direction by allocating using increasing block numbers. Ordering within the Tree While I give here an example of how I have defined keys to optimize locality of reference and packing efficiency, I would like to stress that key definition is a powerful and flexible tool that I am far from finished experimenting with. Some key definition decisions depend very much on usage patterns, and this means that someday one will select from several key definitions when creating the file system. For example, consider the decision of whether to pack all directory entries together at the front of the file system, or to pack the entries near the files they name. For large file usage patterns one should pack all directory items together, since systems with such usage patterns are effective in caching the entries for all directories. For small files the name should be near the file. Similarly, for large files the stat data should be stored separately from the body, either with the other stat data from the same directory, or with the directory entry. (It was likely a mistake for me to not assign stat data its own key in the current implementation, as packing it in with direct and indirect items complicates our code for handling those items, and prevents me from easily experimenting with the effects of changing its key assignment.) It is not necessary for a file's packing to reflect its name, that is merely my default. With each file my next release will offer the option of overriding the default by use of a system call. It is feasible to pack an object completely independently of its semantics using these algorithms, and I predict that there will be many applications, perhaps even most, for which a packing different than that determined by object names is more appropriate. Currently the mandatory tying of packing locality and semantics results in the distortion of both semantics and packing from what might otherwise be their independent optimums, much as tying block boundaries to file boundaries distorts I/O and space allocation algorithms from their separate optimums. For example, placing most files accessed while booting in their access order at the start of the disk is a very tempting future optimization that the use of packing localities makes feasible to consider. The Structure of a Key: Each file item has a key with structure <locality_id, object_id, offset, uniqueness>. The locality_id is by default the object_id of the parent directory. The object_id is the unique id of the file, and this is set to the first unused objectid when the object is created. The tendency of this to result in successive object creations in a directory being adjacently packed is often fortuitous for many usage patterns. For files the offset is the offset within the logical object of the first byte of the item. In version 0.2 all directory entries had their own individual keys stored with them and were each distinct items, in the current version I store one key in the item which is the key of the first entry, and compute each entry's key as needed from the one key stored in the item. For directories the offset key component is the first four bytes of the filename, which you may think of as the lexicographic rather than numeric offset. For directory items the uniqueness field differentiates filename entries identical in the first 4 bytes. For all item types it indicates the item type and for the leftmost item in a buffer it indicates whether the preceding item in the tree is of the same type and object as this item. Placing this information in the key is useful when analyzing balancing conditions, but increases key length for non-directory items, and is a questionable architectural feature. Every file has a unique objectid, but this cannot be used for finding the object, only keys are used for that. Objectids merely ensure that keys are unique. If you never use the reiserfs features that change an object's key then it is immutable, otherwise it is mutable. (This feature aids support for NFS daemons, etc.) We spent quite some time debating internally whether the use of mutable keys for identifying an object had deleterious long term architectural consequences: in the end I decided it was acceptable iff we require any object recording a key to possess a method for updating its copy of it. This is the architectural price of avoiding caching a map of objectid to location that might have very poor locality of reference due to objectids not changing with object semantics. I pack an object with the packing locality of the directory it was first created in unless the key is explicitly changed. It remains packed there even if it is unlinked from the directory. I do not move it from the locality it was created in without an explicit request, unlike the [C-FFS] approach which stores all multiple link files together and pays the cost of moving them from their original locations when the second link occurs. I think a file linked with multiple directories might as well get at least the locality reference benefits of one of those directories. In summary, this approach 1) places files from the same directory together, 2) places directory entries from the same directory together with each other and with the stat data for the directory. Note that there is no interleaving of objects from different directories in the ordering at all, and that all directory entries from the same directory are contiguous. You'll note that this does not accomplish packing the files of small directories with common parents together, and does not employ the full partial ordering in determining the linear ordering, it merely uses parent directory information. I feel the proper place for employing full tree structure knowledge is in the implementation of an FS cleaner, not in the dynamic algorithms. Node Balancing Optimizations When balancing nodes I do so according to the following ordered priorities: 1. minimize number of nodes used 2. minimize number of nodes affected by the balancing operation 3. minimize the number of uncached nodes affected by the balancing operation 4. if shifting to another formatted node is necessary, maximize the bytes shifted Priority 4) is based on the assumption that the location of an insertion of bytes into the tree is an indication of the likely future location of an insertion, and that policy 4 will on average reduce the number of formatted nodes affected by future balance operations. There are more subtle effects as well, in that if one randomly places nodes next to each other, and one has a choice between those nodes being mostly moderately efficiently packed or packed to an extreme of either well or poorly packed, one is more likely to be able to combine more of the nodes if one chooses the policy of extremism. Extremism is a virtue in space efficient node packing. The maximal shift policy is not applied to internal nodes, as extremism is not a virtue in time efficient internal node balancing. Drops (the difficult design issues in the current version that our next version can do better) Consider dividing a file or directory into drops, with each drop having a separate key, and no two drops from one file or directory occupying the same node without being compressed into one drop. The key for each drop is set to the key for the object (file or directory) plus the offset of the drop within the object. For directories the offset is lexicographic and by filename, for files it is numeric and in bytes. In the course of several file system versions we have experimented with and implemented solid, liquid, and air drops. Solid drops were never shifted, and drops would only solidify when they occupied the entirety of a formatted node. Liquid drops are shifted in such a way that any liquid drop which spans a node fully occupies the space in its node. Like a physical liquid it is shiftable but not compressible. Air drops merely meet the balancing condition of the tree. Reiserfs 0.2 implemented solid drops for all but the tail of files. If a file was at least one node in size it would align the start of the file with the start of a node, block aligning the file. This block alignment of the start of multi-drop files was a design error that wasted space: even if the locality of reference is so poor as to make one not want to read parts of semantically adjacent files, if the nodes are near to each other then the cost of reading an extra block is thoroughly dwarfed by the cost of the seek and rotation to reach the first node of the file. As a result the block alignment saves little in time, though it costs significant space for 4-20k files. Reiserfs with block alignment of multi-drop files and no indirect items experienced the following rather interesting behavior that was partially responsible for making it only 88% space efficient for files that averaged 13k (the linux kernel) in size. When the tail of a larger than 4k file was followed in the tree ordering by another file larger than 4k, since the drop before was solid and aligned, and the drop afterwards was solid and aligned, no matter what size the tail was, it occupied an entire node. In the current version we place all but the tail of large files into a level of the tree reserved for full unformatted nodes, and create indirect items in the formatted nodes which point to the unformatted nodes. This is known in the database literature as the [BLOB] approach. This extra level added to the tree comes at the cost of making the tree less balanced (I consider the unformatted nodes pointed to as part of the tree) and increasing the maximal depth of the tree by 1. For medium sized files, the use of indirect items increases the cost of caching pointers by mixing data with them. The reduction in fanout often causes the read algorithms to fetch only one node at a time of the file being read more frequently, as one waits to read the uncached indirect item before reading the node with the file data. There are more parents per file read with the use of indirect items than with internal nodes, as a direct result of reduced fanout due to mixing tails and indirect items in the node. The most serious flaw is that these reads of various nodes necessary to the reading of the file have additional rotations and seeks compared to the case with drops. With my initial drop approach they are usually sequential in their disk layout, even the tail, and the internal node parent points to all of them in such a way that all of them that are contained by that parent or another internal node in cache can be requested at once in one sequential read. Non-sequential reads of nodes are more than an order of magnitude more costly than sequential reads, and this single consideration dominates effective read optimization. Unformatted nodes make file system recovery faster and less robust, in that one reads their indirect item rather than them to insert them into the recovered tree, and one cannot read them to confirm that their contents are from the file that an indirect item says they are from. In this they make reiserfs similar to an inode based system without logging. A moderately better solution would have been to have simply eliminated the requirement for placement of the start of multi-node files at the start of nodes, rather than introducing BLOBs, and to have depended on the use of a file system cleaner to optimally pack the 80% of files that don't move frequently using algorithms that move even solid drops. Yet that still leaves the problem of formatted nodes not being efficient for mmap() purposes (one must copy them before writing rather than merely modifying their page table entries, and memory bandwidth is expensive even if CPU is cheap.) For this reason I have the following plan for the next version. I will have three trees: one tree maps keys to unformatted nodes, one tree maps keys to formatted nodes, and one tree maps keys to directory entries and stat_data. Now it is only natural if you are thinking that that would mean that to read a file and access first the directory entry and stat_data, then the unformatted node, then the tail, one must hop long distances across the disk, going to first one tree and then the other This is indeed why it took me two years to realize it could be made to work. My plan is to interleave the nodes of the three trees according to the following algorithm: Block numbers are assigned to nodes when the nodes are created, or preserved, and someday will be assigned when the cleaner runs. The choice of block number is based on first determining what other node it should be placed near, and then finding the nearest free block that can be found in the elevator's current direction. Currently we use the left neighbor of the node in the tree as the node it should be placed near. This is nice and simple. Oh well. Time to create a virtual neighbor layer. The new scheme will continue to first determine the node it should be placed near, and then start the search for an empty block from that spot, but it will use a more complicated determination of what node to place it near. This method will cause all nodes from the same packing locality to be near each other, will cause all directory entries and stat_data to be grouped together within that packing locality, and will interleaved formatted and unformatted nodes from the same packing locality. Pseudo-code is best for describing this: /* for use by reiserfs_get_new_blocknrs when determining where in the bitmap to start the search for a free block, and for use by read-ahead algorithm when there are not enough nodes to the right and in the same packing locality for packing locality reading ahead purposes */ get_logical_layout_left_neighbors_blocknr(key of current node) { /* Based on examination of current node key and type, find the virtual neighbor of that node. */ If body node if first body node of file if (node in tail tree whose key is less but is in same packing locality exists) return blocknr of such node with largest key else find node with largest key less than key of current node in stat_data tree return its blocknr else return blocknr of node in body tree with largest key less than key of current node else if tail node if (node in body tree belonging to same file as first tail of current node exists) return its blocknr else if (node in tail tree with lesser delimiting key but same packing locality exists) return blocknr of such node with largest delimiting key else return blocknr of node with largest key less than key of current node in stat_data tree else /* is stat_data tree node */ if stat_data node with lesser key from same packing locality exists return blocknr of such node with largest key else /* no node from same packing locality with lesser key exists */ } /* for use by packing locality read-ahead */ get_logical_layout_right_neighbors_blocknr(key of current node) { right-handed version of get_logical_layout_left_neighbors_blocknr logic } It is my hope that this will improve caching of pointers to unformatted nodes, plus improving caching of directory entries and stat_data, by separating them from file bodies to a greater extent. I also hope that it will improve read performance for 1-10k files, and that it will allow us to do this without decreasing space efficiency. Code Complexity I thought it appropriate to mention some of the notable effects of simple design decisions on our implementation's code length. When we changed our balancing algorithms to shift parts of items rather than only whole items, so as to pack nodes tighter, this had an impact on code complexity. Another multiplicative determinant of balancing code complexity was the number of item types, and introducing indirect items doubled this, and changing directory items from being liquid drops to being air drops also increased it. Storing stat data in the first direct or indirect item of the file complicated the code for processing those items more than if I had made stat data its own item type. When one finds oneself with an NxN coding complexity issue, it usually indicates the need for adding a layer of abstraction. The NxN effect of the number of items on balancing code complexity is an instance of that design principle, and we will address it in the next major rewrite. The balancing code will employ a set of item operations which all item types must support. The balancing code will then invoke those operations without caring to understand any more of the meaning of an item's type than that it determines which item specific item operation handler is called. Adding a new item type, say a compressed item, will then merely require writing a set of item operations for that item rather than requiring modifying most parts of the balancing code as it does now. We now feel that the function to determine what resources are needed to perform a balancing operation, fix_nodes(), might as well be written to decide what operations will be performed during balancing since it pretty much has to do so anyway. That way, the function that performs the balancing with the nodes locked, do_balance(), can be gutted of most of its complexity. Buffering & the Preserve List We implemented for version 0.2 of our file system a system of write ordering that tracked all shifting of items in the tree, and ensured that no node that had had an item shifted from it was written before the node that had received the item was written. This is necessary to prevent a system crash from causing the loss of an item that might not be recently created. This tracking approach worked, and the overhead it imposed was not measurable in our benchmarks. When in the next version we changed to partially shifting items and increased the number of item types, this code grew out of control in its complexity. I decided to replace it with a far simpler to code scheme that was also more effective in typical usage patterns. This scheme was as follows: If an item is shifted from a node, change the block that its buffer will be written to. Change it to the nearest free block to the old blocks left neighbor, and rather than freeing it, place the old block number on a ``preserve list''. (Saying nearest is slightly simplistic, in that the blocknr assignment function moves from the left neighbor in the direction of increasing block numbers.) When a ``moment of consistency'' is achieved, free all of the blocks on the preserve list. A moment of consistency occurs when there are no nodes in memory into which objects have been shifted (this could be made more precise but then it would be more complex). If disk space runs out, force a moment of consistency to occur. This is sufficient to ensure that the file system is recoverable. Note that during the large file benchmarks the preserve list was freed several times in the middle of the benchmark. The percentage of buffers preserved is small in practice except during deletes, and one can arrange for moments of consistency to occur as frequently as one wants to. Note that I make no claim that this approach is better than the Soft Updates approach employed by [Granger] or by us in version 0.2, I merely note that tracking order of writes is more complex than this approach for balanced trees which partially shift items. We may go back to the old approach some day, though not to the code that I threw out. Preserve lists substantially hamper performance for files in the 1-10k size range. We are re-evaluating them. Ext2fs avoids the metadata shifting problem by never shrinking directories, and using fixed inode space allocations. Lessons From Log Structured File Systems Many techniques from other file systems haven't been applied primarily so as to satisfy my goal of giving reiserfs 1.0 only the minimum feature set necessary to be useful, and will appear in later releases. Log Structured File Systems [Rosenblum and Ousterhout] embody several such techniques, which I will describe after I mention two concerns with that approach: * With small object file systems it is not feasible to cache in RAM a map of objectid to location for every object since there are too many objects. This is an inherent problem in using temporal packing rather than semantic packing for small object file systems. With my approach my internal nodes are the equivalent of this objectid to location map, but internal node total size is proportional to the number of nodes rather than the number of objects. You can think of internal nodes as a compression of object location information made effective by the existence of an ordering function, and this compression is both essential for small files, and a major feature of my approach. * I like obtaining good though not ideal semantic locality without paying a cleaning cost for active data. This is a less critical concern. I frequently find myself classifying packing and layout optimizations as either appropriate for implementing dynamically or appropriate only for a cleaner. Optimizations whose computational overhead is large compared to their benefit tend to be appropriate for implementation in a cleaner, and a cleaner's benefits mostly impact the static portion of the file system (which typically consumes ~80% of the space.) Such objectives as 100% packing efficiency, exactly ordering block layout by semantic order, using the full semantic tree rather than parent directory in determining semantic order, compression, these are all best implemented by cleaner approaches. In summary, there is much to be learned from the LFS approach, and as I move past my initial objective of supplying a minimal feature higher performance FS I will apply some of those lessons. In the Preserve Lists section I speculate on the possibilities for a fastboot implementation that would merge the better features of preserve lists and logging. Directions For the Future To go one more order of magnitude smaller in file size will require adding functionality to the file system API, though it will not require discarding upward compatibility. The use of an exokernel is a better approach to small files if it is an option available to the OS designer, it is not currently an option for Linux users. In the future reiserfs will add such features as lightweight files in which stat_data other than size is inherited from a parent if it is not created individually for the file, an API for reading and writing to files without requiring the overhead of file handles and open(), set-theoretic semantics, and many other features that you would expect from researchers who expect to be able to do all that they could do in a database, in the file system, and never really did understand why not. Conclusion Balanced tree file systems are inherently more space efficient than block allocation based file systems, with the differences reaching order of magnitude levels for small files. While other aspects of design will typically have a greater impact on performance for large files, in direct proportion to the smallness of the file the use of balanced trees offers performance advantages. A moderate advantage was found for large files. Coding cost is mostly in the interfaces, and it is a measure of the OS designer's skill whether those costs are low in the OS. We make it possible for an OS designer to use the same interface for large and small objects, and thereby reduce interface coding cost. This approach is a new tool available to the OS designer for increasing the expressive power of all of the components in the OS through better name space integration. Researchers interested in collaborating or just using my work will find me friendly. I tailor the framework of my collaborations to the needs of those we work with. I GPL reiserfs so as to meet the needs of academic collaborators. While that makes it unusable without a special license for commercial OSes, commercial vendors will find me friendly in setting up a commercial framework for commercial collaboration with commercial needs provided for. Acknowledgments Hans Reiser was the project initiator, primary architect, supplier of funding, and one of the programmers. Some folks at times remark that naming the filesystem Reiserfs was egotistic. It was so named after a potential investor hired all of my employees away from me, then tried to negotiate better terms for his possible investment, and suggested that he could arrange for 100 researchers to swear in Russian Court that I had had nothing to do with this project. That business partnership did not work out. Vladimir Saveljev, while he did not author this paper, worked long hours writing the largest fraction of the lines of code in the file system, and is remarkably gifted at just making things work. Thanks Vladimir. Anatoly Pinchuk wrote much of the core balancing code, and too much of the rest to list here. Thanks, Anatoly. It is the policy of the Naming System Venture that if someone quits before project completion, and then takes strong steps to try to prevent others from finishing the project, that they shall not be mentioned in the acknowledgements. This was all quite sad, and best forgotten. I would like to thank Alfred Ajlamazyan for his generosity in providing overhead at a time when his institute had little it could easily spare. Grigory Zaigralin is thanked for his work in making the machines run, administering the money, and being his usual determined to be useful self. Igor Chudov, thanks for such effective procurement and hardware maintenance work. Eirik Fuller is thanked for his help with NFS and porting to 2.1. I would like to thank Remi Card for the superb block allocation based file system (ext2fs) that I depended on for so many years, and that allowed me to benchmark against the best. Linus Torvalds, thank you for Linux. Business Model and Licensing I personally favor performing a balance of commercial and public works in my life. I have no axe to grind against software that is charged for, and no regrets at making reiserfs freely available to Linux users. This project is GPL'd, but I sell exceptions to the GPL to commercial OS vendors and file server vendors. It is not usable to them without such exceptions, and many of them are wise enough to understand that: * the porting and integration service we are able to provide with the licensing is by itself worth what we charge, * that these services impact their time to market, * and that the relationship spreads the development costs across more OS vendors than just them alone I expect that Linux will prove to be quite effective in market sampling my intended market, but if you suspect that I also like seeing more people use it even if it is free to them, oh well. I believe it is not so much the cost that has made Linux so successful as it is the openness. Linux is a decentralized economy with honor and recognition as the currency of payment (and thus there is much honor in it). Commercial OS vendors are, at the moment, all closed economies, and doomed to fall in their competition with open economies just as communism eventually fell. At some point an OS vendor will realize that if it: * opens up its source code to decentralized modification, * systematically rewards those who perform the modifications that are proven useful, * systematically merges/integrates those modifications into its branded primary release branch while adding value as the integrator, that it will acquire both the critical mass of the internet development community, and the aggressive edge that no large communal group (such as a corporation) can have. Rather than saying to any such vendor that they should do this now, let me simply point out that whoever is first will have an enormous advantage..... Since I have more recognition than money to pass around as reward, my policy is to tend to require that those who contribute substantial software to this project have their names attached to a user visible portion of the project. This official policy helps me deal with folks like Vladimir, who was much too modest to ever name the file system checker vsck without my insisting. Smaller contributions are to be noted in the source code, and the acknowledgements section of this paper. If you choose to contribute to this file system, and your work is accepted into the primary release, you should let me know if you want me to look for opportunities to integrate you into contracts from commercial vendors. Through packaging ourselves as a group, we are more marketable to such OS vendors. Many of us have spent too much time working at day jobs unrelated to our Linux work. This is too hard, and I hope to make things easier for us all. If you like this business model of selling GPL'd component software with related support services, but you write software not related to this file system, I encourage you to form a component supplier company also. Opportunities may arise for us to cooperate in our marketing, and I will be happy to do so. References G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees. [Apple] Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting file system architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger, and the code has not received sufficient further development. [Bach] Maurice J. Bach, ``The Design of the Unix Operating System'', 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the file system routines and interfaces in a manner especially useful for those trying to implement a Unix compatible file system. See [Vahalia]. [BLOB] R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, See Drops section for a discussion of how this approach makes the tree less balanced, and the effect that has on performance. [Chen] Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page. [C-FFS] Ganger, Gregory R., Kaashoek, M. Frans, page with link to postscript paper A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in reiserfs performance. [ext2fs] by Remi Card extensive information, source code is available When you consider how small this file system is (~6000 lines), its effectiveness becomes all the more remarkable. [FFS] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181--197, August 1984 describes the implementation of a file system which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation file systems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (reiserfs uses preserve lists, forgive my egotism in thinking that it is enough work for me to ensure that reiserfs solves the recovery problem, and to perhaps suggest that ext2fs would benefit from the use of preserve lists when shrinking directories) [Ganger] Gregory R. Ganger, Yale N. Patt, ``Metadata Update Performance in File Systems'' abstract only [Gifford], postscript only Describes a file system enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure. [Hitz, Dave]http://www.netapp.com/technology/level3/3002.html A rather well designed file system optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. [Holton and Das] , Holton, Mike., Das, Raj., ``The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file).'' Note that it is still a block (extent) allocation based file system, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from reiserfs, and is an excellent design for that purpose (and I would concede that reiserfs 1.0 is not suitable for their real-time large I/O market.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details, and source code is not freely available. [Howard] ``Scale and Performance in a Distributed File System'', Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J., ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound for both ext2fs and reiserfs. [Knuth] Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length. [LADDIS] Wittle, Mark., and Bruce, Keith., ``LADDIS: The Next Generation in NFS File Server Benchmarking'', Proceedings of the Summer 1993 USENIX Conference.'', July 1993, pp. 111-128 [Lewis and Denenberg] Lewis, Harry R., Denenberg, Larry ``Data Structures & Their Algorithms'', HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors. [McCreight] McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records. [McVoy and Kleiman], the implementation of write-clustering for Sun's UFS. [OLE] ``Inside OLE'' by Kraig Brockshmidt, discusses Structured Storage, HREF="http://www.microsoft.com/mspress/books/abs/5-843-2b.htm" abstract only [Ousterhout] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. A trace-driven analysis of the UNIX 4.2BSD file system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985. [NTFS] ``Inside the Windows NT File System'' the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. [Peacock] K. Peacock, ``The CounterPoint Fast File System'', Proceedings of the Usenix Conference Winter 1988 [Pike] Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html [Rosenblum and Ousterhout] ``The Design and Implementation of a Log-Structured File System'', Mendel Rosenblum and John K. Ousterhout, February 1992 ACM Transactions on Computer Systems, this paper was quite influential in a number of ways on many modern file systems, and the notion of using a cleaner may be applied to a future release of reiserfs. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to. [Snyder] , ``tmpfs: A Virtual Memory File System'' discusses a file system built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance. [Vahalia] Uresh Vahalia, ``Unix Kernal Internals'' 63166384e0dedfbff07fa5c0ea1ee898eed40a2f Talk:Future Vision 1 64 1738 1691 2010-04-25T04:30:58Z Chris goe 2 moved to article page da39a3ee5e6b4b0d3255bfef95601890afd80709 1691 1688 2010-04-16T04:30:08Z Chris goe 2 wording This document has been retrieved from [http://web.archive.org/web/20061113154621/www.namesys.com/whitepaper.html archive.org]. -- 2006-11-13 0496d0698aa881c8f05b9bc68ad8fecbcf11a63d 1688 1589 2010-04-16T04:25:28Z Chris goe 2 moved to article page * document from: http://web.archive.org/web/20061113154621/www.namesys.com/whitepaper.html d31848305609a9bae072ec7f103a24f915924e73 1589 2009-07-06T01:42:44Z Chris goe 2 Created page with '* document from: http://web.archive.org/web/20061113154621/www.namesys.com/whitepaper.html By Hans Reiser http://namesys.com 6114 La Salle ave., #405, Oakland, CA 94611 email...' * document from: http://web.archive.org/web/20061113154621/www.namesys.com/whitepaper.html By Hans Reiser http://namesys.com 6114 La Salle ave., #405, Oakland, CA 94611 email: reiser@namesys.com a529653db8cb33c6be65d8ca7e6acde47d87c7c9 Talk:Reiser4 1 93 1733 2010-04-25T04:19:30Z Chris goe 2 moved [[Talk:Reiser4]] to [[Talk:V4]]:&#32;We'll use the Reiser4 page for something else #REDIRECT [[Talk:V4]] 6e14242776b5bdb9b64a361d7ea749f83354f983 Talk:Reiser4 Howto/GRUB 1 60 1578 1576 2009-07-04T20:11:45Z Chris goe 2 /* undefined reference to __stack_chk_fail */ Yes, this howto is heavily based on [http://m.domaindlx.com/LinuxHelp/installs/grub-reiser4.htm this howto over here] -- [[User:Chris goe|Chris goe]] 18:33, 3 July 2009 (UTC) == implicit declaration of function 'objplug' == Hm, compiling against reiser4progs-1.0.7 fails with: <pre> fsys_reiser4.c: In function 'reiser4_read': fsys_reiser4.c:126: warning: implicit declaration of function 'objplug' fsys_reiser4.c:126: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:127: warning: implicit declaration of function 'plug_call' fsys_reiser4.c:127: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:128: error: 'seek' undeclared (first use in this function) fsys_reiser4.c:128: error: (Each undeclared identifier is reported only once fsys_reiser4.c:128: error: for each function it appears in.) fsys_reiser4.c:128: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:133: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:134: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:135: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c: In function 'reiser4_dir': fsys_reiser4.c:155: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:156: error: 'close' undeclared (first use in this function) fsys_reiser4.c:156: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:195: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:195: error: 'OPSET_OBJ' undeclared (first use in this function) fsys_reiser4.c:202: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:203: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:209: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:210: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:210: error: 'readdir' undeclared (first use in this function) fsys_reiser4.c:211: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:233: error: 'reiser4_object_t' has no member named 'ent' make[2]: *** [pre_stage2_exec-fsys_reiser4.o] Error 1 make[2]: Leaving directory `/usr/local/src/grub-0.97/stage2' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/local/src/grub-0.97' make: *** [all] Error 2 </pre> Gonna try with <tt>reiser4progs-1.0.5.tar.gz</tt> later on. -- [[User:Chris goe|Chris goe]] 20:12, 3 July 2009 (UTC) == undefined reference to __stack_chk_fail == Compiling against reiser4progs-1.0.6 failed as well: <pre> gcc -fno-stack-protector -I/opt/reiser4progs-1.0.5/include -L/opt/libaal/lib -L/opt/reiser4progs-1.0.5/lib -o pre_stage2.exec -nostdlib -Wl,-N -Wl,-Ttext -Wl,8200 pre_stage2_exec-asm.o pre_stage2_exec-bios.o pre_stage2_exec-boot.o pre_stage2_exec-builtins.o pre_stage2_exec-char_io.o pre_stage2_exec-cmdline.o pre_stage2_exec-common.o pre_stage2_exec-console.o pre_stage2_exec-disk_io.o pre_stage2_exec-fsys_ext2fs.o pre_stage2_exec-fsys_fat.o pre_stage2_exec-fsys_ffs.o pre_stage2_exec-fsys_iso9660.o pre_stage2_exec-fsys_jfs.o pre_stage2_exec-fsys_minix.o pre_stage2_exec-fsys_reiserfs.o pre_stage2_exec-fsys_reiser4.o pre_stage2_exec-fsys_ufs2.o pre_stage2_exec-fsys_vstafs.o pre_stage2_exec-fsys_xfs.o pre_stage2_exec-gunzip.o pre_stage2_exec-hercules.o pre_stage2_exec-md5.o pre_stage2_exec-serial.o pre_stage2_exec-smp-imps.o pre_stage2_exec-stage2.o pre_stage2_exec-terminfo.o pre_stage2_exec-tparm.o pre_stage2_exec-graphics.o -lreiser4-minimal -laal-minimal /opt/reiser4progs-1.0.5/lib/libreiser4-minimal.a(libreiser4_minimal_la-semantic.o): In function `cb_find_entry': semantic.c:(.text+0x1a0): undefined reference to `__stack_chk_fail' /opt/reiser4progs-1.0.5/lib/libreiser4-minimal.a(libaux_minimal_la-aux.o): In function `aux_parse_path': aux.c:(.text+0x28d): undefined reference to `__stack_chk_fail_local' /opt/reiser4progs-1.0.5/lib/libreiser4-minimal.a(libdir40_minimal_la-dir40.o): In function `dir40_readdir': dir40.c:(.text+0x604): undefined reference to `__stack_chk_fail_local' /usr/bin/ld: pre_stage2.exec: hidden symbol `__stack_chk_fail_local' isn't defined /usr/bin/ld: final link failed: Nonrepresentable section on output collect2: ld returned 1 exit status make[2]: *** [pre_stage2.exec] Error 1 make[2]: Leaving directory `/usr/local/src/grub-0.97/stage2' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/local/src/grub-0.97' make: *** [all] Error 2 </pre> Although <tt>-fno-stack-protector</tt> has been added, the reiser4progs compiled earlier were still using SSP - so I had to recompile reiser4progs again with <tt>-fno-stack-protector</tt> as well. 59218ac8728b11de17dd1cea80a1e7b9acb98bec 1576 1575 2009-07-04T19:41:54Z Chris goe 2 /* undefined reference to __stack_chk_fail */ new section Yes, this howto is heavily based on [http://m.domaindlx.com/LinuxHelp/installs/grub-reiser4.htm this howto over here] -- [[User:Chris goe|Chris goe]] 18:33, 3 July 2009 (UTC) == implicit declaration of function 'objplug' == Hm, compiling against reiser4progs-1.0.7 fails with: <pre> fsys_reiser4.c: In function 'reiser4_read': fsys_reiser4.c:126: warning: implicit declaration of function 'objplug' fsys_reiser4.c:126: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:127: warning: implicit declaration of function 'plug_call' fsys_reiser4.c:127: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:128: error: 'seek' undeclared (first use in this function) fsys_reiser4.c:128: error: (Each undeclared identifier is reported only once fsys_reiser4.c:128: error: for each function it appears in.) fsys_reiser4.c:128: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:133: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:134: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:135: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c: In function 'reiser4_dir': fsys_reiser4.c:155: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:156: error: 'close' undeclared (first use in this function) fsys_reiser4.c:156: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:195: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:195: error: 'OPSET_OBJ' undeclared (first use in this function) fsys_reiser4.c:202: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:203: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:209: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:210: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:210: error: 'readdir' undeclared (first use in this function) fsys_reiser4.c:211: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:233: error: 'reiser4_object_t' has no member named 'ent' make[2]: *** [pre_stage2_exec-fsys_reiser4.o] Error 1 make[2]: Leaving directory `/usr/local/src/grub-0.97/stage2' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/local/src/grub-0.97' make: *** [all] Error 2 </pre> Gonna try with <tt>reiser4progs-1.0.5.tar.gz</tt> later on. -- [[User:Chris goe|Chris goe]] 20:12, 3 July 2009 (UTC) == undefined reference to __stack_chk_fail == Compiling against reiser4progs-1.0.6 failed as well: <pre> gcc -fno-stack-protector -I/opt/reiser4progs-1.0.5/include -L/opt/libaal/lib -L/opt/reiser4progs-1.0.5/lib -o pre_stage2.exec -nostdlib -Wl,-N -Wl,-Ttext -Wl,8200 pre_stage2_exec-asm.o pre_stage2_exec-bios.o pre_stage2_exec-boot.o pre_stage2_exec-builtins.o pre_stage2_exec-char_io.o pre_stage2_exec-cmdline.o pre_stage2_exec-common.o pre_stage2_exec-console.o pre_stage2_exec-disk_io.o pre_stage2_exec-fsys_ext2fs.o pre_stage2_exec-fsys_fat.o pre_stage2_exec-fsys_ffs.o pre_stage2_exec-fsys_iso9660.o pre_stage2_exec-fsys_jfs.o pre_stage2_exec-fsys_minix.o pre_stage2_exec-fsys_reiserfs.o pre_stage2_exec-fsys_reiser4.o pre_stage2_exec-fsys_ufs2.o pre_stage2_exec-fsys_vstafs.o pre_stage2_exec-fsys_xfs.o pre_stage2_exec-gunzip.o pre_stage2_exec-hercules.o pre_stage2_exec-md5.o pre_stage2_exec-serial.o pre_stage2_exec-smp-imps.o pre_stage2_exec-stage2.o pre_stage2_exec-terminfo.o pre_stage2_exec-tparm.o pre_stage2_exec-graphics.o -lreiser4-minimal -laal-minimal /opt/reiser4progs-1.0.5/lib/libreiser4-minimal.a(libreiser4_minimal_la-semantic.o): In function `cb_find_entry': semantic.c:(.text+0x1a0): undefined reference to `__stack_chk_fail' /opt/reiser4progs-1.0.5/lib/libreiser4-minimal.a(libaux_minimal_la-aux.o): In function `aux_parse_path': aux.c:(.text+0x28d): undefined reference to `__stack_chk_fail_local' /opt/reiser4progs-1.0.5/lib/libreiser4-minimal.a(libdir40_minimal_la-dir40.o): In function `dir40_readdir': dir40.c:(.text+0x604): undefined reference to `__stack_chk_fail_local' /usr/bin/ld: pre_stage2.exec: hidden symbol `__stack_chk_fail_local' isn't defined /usr/bin/ld: final link failed: Nonrepresentable section on output collect2: ld returned 1 exit status make[2]: *** [pre_stage2.exec] Error 1 make[2]: Leaving directory `/usr/local/src/grub-0.97/stage2' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/local/src/grub-0.97' make: *** [all] Error 2 </pre> a286238e58451d2209dd5a6f201dd825a317d5ee 1575 1567 2009-07-04T19:40:46Z Chris goe 2 reiser4progs-1.0.7 Yes, this howto is heavily based on [http://m.domaindlx.com/LinuxHelp/installs/grub-reiser4.htm this howto over here] -- [[User:Chris goe|Chris goe]] 18:33, 3 July 2009 (UTC) == implicit declaration of function 'objplug' == Hm, compiling against reiser4progs-1.0.7 fails with: <pre> fsys_reiser4.c: In function 'reiser4_read': fsys_reiser4.c:126: warning: implicit declaration of function 'objplug' fsys_reiser4.c:126: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:127: warning: implicit declaration of function 'plug_call' fsys_reiser4.c:127: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:128: error: 'seek' undeclared (first use in this function) fsys_reiser4.c:128: error: (Each undeclared identifier is reported only once fsys_reiser4.c:128: error: for each function it appears in.) fsys_reiser4.c:128: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:133: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:134: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:135: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c: In function 'reiser4_dir': fsys_reiser4.c:155: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:156: error: 'close' undeclared (first use in this function) fsys_reiser4.c:156: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:195: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:195: error: 'OPSET_OBJ' undeclared (first use in this function) fsys_reiser4.c:202: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:203: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:209: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:210: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:210: error: 'readdir' undeclared (first use in this function) fsys_reiser4.c:211: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:233: error: 'reiser4_object_t' has no member named 'ent' make[2]: *** [pre_stage2_exec-fsys_reiser4.o] Error 1 make[2]: Leaving directory `/usr/local/src/grub-0.97/stage2' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/local/src/grub-0.97' make: *** [all] Error 2 </pre> Gonna try with <tt>reiser4progs-1.0.5.tar.gz</tt> later on. -- [[User:Chris goe|Chris goe]] 20:12, 3 July 2009 (UTC) 420a3a89d3b827295dc1c38f1c601f303b88d687 1567 1563 2009-07-03T20:12:53Z Chris goe 2 /* implicit declaration of function 'objplug' */ new section Yes, this howto is heavily based on [http://m.domaindlx.com/LinuxHelp/installs/grub-reiser4.htm this howto over here] -- [[User:Chris goe|Chris goe]] 18:33, 3 July 2009 (UTC) == implicit declaration of function 'objplug' == Hm, compiling still fails with: <pre> fsys_reiser4.c: In function 'reiser4_read': fsys_reiser4.c:126: warning: implicit declaration of function 'objplug' fsys_reiser4.c:126: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:127: warning: implicit declaration of function 'plug_call' fsys_reiser4.c:127: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:128: error: 'seek' undeclared (first use in this function) fsys_reiser4.c:128: error: (Each undeclared identifier is reported only once fsys_reiser4.c:128: error: for each function it appears in.) fsys_reiser4.c:128: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:133: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:134: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:135: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c: In function 'reiser4_dir': fsys_reiser4.c:155: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:156: error: 'close' undeclared (first use in this function) fsys_reiser4.c:156: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:195: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:195: error: 'OPSET_OBJ' undeclared (first use in this function) fsys_reiser4.c:202: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:203: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:209: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:210: error: invalid type argument of '->' (have 'int') fsys_reiser4.c:210: error: 'readdir' undeclared (first use in this function) fsys_reiser4.c:211: error: 'reiser4_object_t' has no member named 'ent' fsys_reiser4.c:233: error: 'reiser4_object_t' has no member named 'ent' make[2]: *** [pre_stage2_exec-fsys_reiser4.o] Error 1 make[2]: Leaving directory `/usr/local/src/grub-0.97/stage2' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/local/src/grub-0.97' make: *** [all] Error 2 </pre> Gonna try with <tt>reiser4progs-1.0.5.tar.gz</tt> later on. -- [[User:Chris goe|Chris goe]] 20:12, 3 July 2009 (UTC) b91064d3fa696e35bbcdac11123b43b3f45e4f93 1563 2009-07-03T18:33:46Z Chris goe 2 Created page with 'Yes, this howto is heavily based on [http://m.domaindlx.com/LinuxHelp/installs/grub-reiser4.htm this howto over here] -- ~~~~' Yes, this howto is heavily based on [http://m.domaindlx.com/LinuxHelp/installs/grub-reiser4.htm this howto over here] -- [[User:Chris goe|Chris goe]] 18:33, 3 July 2009 (UTC) 33cffb46e802a601c5348afe3d00979e9bb47abb Talk:ReiserFS 1 91 1727 2010-04-25T04:15:04Z Chris goe 2 moved [[Talk:ReiserFS]] to [[Talk:X0reiserfs]]:&#32;We'll use the ReiserFS article for something else #REDIRECT [[Talk:X0reiserfs]] e04117d7c3339b5d239f9d69c72a275ba5b97e70 Talk:Txn-doc 1 72 1736 1692 2010-04-25T04:26:04Z Chris goe 2 moved to article page da39a3ee5e6b4b0d3255bfef95601890afd80709 1692 1598 2010-04-16T04:31:29Z Chris goe 2 wording This document has been retrieved from [http://web.archive.org/web/20061113154854/www.namesys.com/txn-doc.html archive.org]. -- 2006-11-13 37110d5d8403d5ab23efa02147bde6c494b73a15 1598 2009-07-06T01:58:56Z Chris goe 2 Created page with '* from: http://web.archive.org/web/20061113154854/www.namesys.com/txn-doc.html Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev' * from: http://web.archive.org/web/20061113154854/www.namesys.com/txn-doc.html Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev 0701346084f73fdd59b7f339c9cb78af00d5b8a9 Talk:V4 1 73 1740 1732 2010-04-25T04:33:52Z Chris goe 2 moved to article page da39a3ee5e6b4b0d3255bfef95601890afd80709 1732 1690 2010-04-25T04:19:30Z Chris goe 2 moved [[Talk:Reiser4]] to [[Talk:V4]]:&#32;We'll use the Reiser4 page for something else This document has been retrieved from [http://web.archive.org/web/20061113154600/www.namesys.com/v4/v4.html archive.org]. -- 2006-11-13 f36add9d4aa85a5555a5e49cf7f0f9f6ee8f59f4 1690 1606 2010-04-16T04:29:08Z Chris goe 2 wording This document has been retrieved from [http://web.archive.org/web/20061113154600/www.namesys.com/v4/v4.html archive.org]. -- 2006-11-13 f36add9d4aa85a5555a5e49cf7f0f9f6ee8f59f4 1606 2009-07-06T06:42:57Z Chris goe 2 Created page with '* from: http://web.archive.org/web/20061113154600/www.namesys.com/v4/v4.html' * from: http://web.archive.org/web/20061113154600/www.namesys.com/v4/v4.html 5c1d8cf719ca8f3dfb7adec8a811ca2bf349d759 Talk:X0reiserfs 1 89 1726 1708 2010-04-25T04:15:04Z Chris goe 2 moved [[Talk:ReiserFS]] to [[Talk:X0reiserfs]]:&#32;We'll use the ReiserFS article for something else da39a3ee5e6b4b0d3255bfef95601890afd80709 1708 1698 2010-04-25T02:54:07Z Chris goe 2 moved to article page da39a3ee5e6b4b0d3255bfef95601890afd80709 1698 2010-04-25T02:24:16Z Chris goe 2 http://web.archive.org/web/20061113154730/www.namesys.com/X0reiserfs.html This document has been retrieved from [http://web.archive.org/web/20061113154730/www.namesys.com/X0reiserfs.html archive.org]. -- 2006-11-13 d3f0515b7361c497dd16b26ff2e42cc15a0ce677 User:Chris goe 2 2 4267 4143 2017-06-25T17:34:17Z Chris goe 2 <pre> ____________________________________ < reiser4wiki_at_nerdbynature_dot_de > ------------------------------------ \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || </pre> * [[/maintenance/]] 7edb34206c442ac711ec592f5a01c0ede981592e 4143 4055 2016-05-26T16:22:29Z Chris goe 2 [mailto:reiser4wiki_at_nerdbynature_dot_de Christian Kujau] * [[/maintenance/]] ff6114c082b5e3fea7f3ebec14fab4f30ba10c0a 4055 4054 2015-04-14T17:31:49Z Chris goe 2 / [mailto:reiser4wiki@nerdbynature.de Christian Kujau] * [[/maintenance/]] 54a4ab73547edb621b074f760dc9a85075f04aea 4054 2212 2015-04-14T17:31:24Z Chris goe 2 -replayonly [mailto:reiser4wiki@nerdbynature.de Christian Kujau] * [[/maintenance]] 015134718111639f82d0b72a8d8a2bac4cd0dd54 2212 1802 2011-04-04T17:44:36Z Chris goe 2 maintenance added [mailto:reiser4wiki@nerdbynature.de Christian Kujau] * [[/replayonly]] * [[/maintenance]] db3fc514847faa016f616061fb7cc8cdb6640d75 1802 1771 2010-10-18T00:18:10Z Chris goe 2 mod_speling [mailto:reiser4wiki@nerdbynature.de Christian Kujau] * [[/replayonly]] 471b98af98e61b3c182afa59ef80dbe2f815af37 1771 1668 2010-08-25T20:47:55Z Chris goe 2 [mailto:reser4wiki@nerdbynature.de Christian Kujau] * [[/replayonly]] 864b157d89a376c8ce356ab94213dfad3e749ac5 1668 1612 2010-02-10T05:19:28Z Chris goe 2 * [[/replayonly]] ---- [mailto:lists___nospam@nerdbynature.de me] 906cbc4deeaae2c1a8941d7e748d9e7b9264a527 1612 1283 2009-07-21T06:12:01Z Chris goe 2 [mailto:lists___nospam@nerdbynature.de me] 617d7300808186d37cfece9bf5306e961aecc826 1283 2009-06-25T00:23:33Z Chris goe 2 Created page with '[mailto:lists___nospam@nerdbynature.de Christian]' [mailto:lists___nospam@nerdbynature.de Christian] 94c2b55e327b04a08a28e8f890523dc4487d7247 User:Chris goe/maintenance 2 112 4053 2222 2015-04-14T16:48:51Z Chris goe 2 maintenance page update * [[Special:BrokenRedirects]] * [[Special:DeadendPages]] * [[Special:DoubleRedirects]] * [[Special:Lonelypages]] * [[Special:UncategorizedCategories]] * [[Special:UncategorizedFiles]] * [[Special:UncategorizedPages]] * [[Special:UncategorizedTemplates]] * [[Special:UnusedCategories]] * [[Special:UnusedFiles]] * [[Special:UnusedTemplates]] * [[Special:WantedCategories]] * [[Special:WantedFiles]] * [[Special:WantedPages]] * [[Special:WantedTemplates]] <small><div align=right>v{{CURRENTVERSION}}/{{CONTENTLANGUAGE}}</div></small> 9fc51291549e73612c6023f909a0dde48191c2d4 2222 2011-04-04T17:44:50Z Chris goe 2 Created page with '* [[Special:BrokenRedirects]] * [[Special:DoubleRedirects]] * [[Special:Lonelypages]] * [[Special:Uncategorizedpages]] * [[Special:Uncategorizedtemplates]] * [[Special:Unusedcate…' * [[Special:BrokenRedirects]] * [[Special:DoubleRedirects]] * [[Special:Lonelypages]] * [[Special:Uncategorizedpages]] * [[Special:Uncategorizedtemplates]] * [[Special:Unusedcategories]] * [[Special:Wantedcategories]] * [[Special:Wantedpages]] <!-- * [[Special:WantedFiles]] (since [http://www.mediawiki.org/wiki/Release_notes/1.14 1.14]) --> 22197f6e21a31c645f3950d1b4b5938163322ac0 User:DusanC 2 1085 4049 2015-04-14T16:36:26Z Chris goe 2 Creating user page for new user. A lkml lurker since '99, Reiser4 user since 2007. Using Gentoo since 2007 because I've been interested how stuff works. Currently trying to get continuous (preferably automated) Reiser4 testing effort off the ground and to get some documentation too like reiser4 FAQ that I did for Gentoo https://forums.gentoo.org/viewtopic-t-706171.html 8e310744b0b8c8b05ba2c2c1197a9ab9c2ec30c3 User:Georgios Tsalikis 2 1083 4047 2015-04-14T16:35:43Z Chris goe 2 Creating user page for new user. I am a Computer Science "major" and a Reiser4 follower. Algorithms are my favorite topic for the moment and artificial intelligence is something i would like to work on in the future. I have attended, but not completed yet, medical school. I am also seeking informal studies in higher mathematics. a404995046f80e8f7a5775354a1d1e7f63a078bd User:Korgsysop 2 1060 3695 2013-07-23T17:30:45Z Korgsysop 30308 Created page with "Kernel.org sysop user." Kernel.org sysop user. 1e314d448c4c10732ec035302bd182e122ea5e7f User:Thanatermesis 2 1087 4051 2015-04-14T16:37:53Z Chris goe 2 Creating user page for new user. Founder and developer of Elive GNU/Linux not much more to say, (Your biography must be at least 50 words long. Your biography must be at least 50 words long. Your biography must be at least 50 words long. Your biography must be at least 50 words long. Your biography must be at least 50 words long. Your biography must be at least 50 words long. Your biography must be at least 50 words long. Your biography must be at least 50 words long. Your biography must be at least 50 words long.... ) 58708592c1899abce2a2beec8d4912f3d4e4feba User talk:DusanC 3 1086 4050 2015-04-14T16:36:36Z Chris goe 2 Welcome! '''Welcome to ''Reiser4 FS Wiki''!''' We hope you will contribute much and well. You will probably want to read the [[Help:Contents|help pages]]. Again, welcome and have fun! [[User:Chris goe|Chris goe]] ([[User talk:Chris goe|talk]]) 16:36, 14 April 2015 (UTC) 7d512fc026e5bf1fb38f810c2339162714cad0f0 User talk:Georgios Tsalikis 3 1084 4048 2015-04-14T16:35:53Z Chris goe 2 Welcome! '''Welcome to ''Reiser4 FS Wiki''!''' We hope you will contribute much and well. You will probably want to read the [[Help:Contents|help pages]]. Again, welcome and have fun! [[User:Chris goe|Chris goe]] ([[User talk:Chris goe|talk]]) 16:35, 14 April 2015 (UTC) 8851ae7a57169a38aeeb0a09385803922327866a User talk:Thanatermesis 3 1088 4052 2015-04-14T16:38:03Z Chris goe 2 Welcome! '''Welcome to ''Reiser4 FS Wiki''!''' We hope you will contribute much and well. You will probably want to read the [[Help:Contents|help pages]]. Again, welcome and have fun! [[User:Chris goe|Chris goe]] ([[User talk:Chris goe|talk]]) 16:38, 14 April 2015 (UTC) 6193ee3c6c23648196737a53a5b4bffb92a0f3a7 Reiser4 FS Wiki:About 4 77 1694 1648 2010-04-16T05:49:38Z Chris goe 2 s/servername/server/ This Wiki (currently available under {{SERVER}}) exists to document development of the [[ReiserFS]] and [[Reiser4]] filesystems. As the header states, quite a few pages are copies of the last working copy of the [http://www.archive.org/web/web.php Wayback Machine] from the original sources - the [http://www.namesys.com namesys.com] site. However, as the time of writing (and since creation of this wiki, 2009-06-25), the original content is not accessible any more. As the pages were created, the first changelog entry usually lists the original source where the content was copied from (example: [{{SERVER}}/index.php?title=Reiser4&diff=1611&oldid=1319 Reiser4]). Please refer to [[Reiser4_FS_Wiki:Copyrights|the copyrights page]] for details. b9dee238713b9ae06b1d47d2987e0891a9eb7915 1648 2010-01-30T03:48:07Z Chris goe 2 Created page with 'This Wiki (currently available under {{SERVERNAME}}) exists to document development of the [[ReiserFS]] and [[Reiser4]] filesystems. As the header states, quite a few pages are …' This Wiki (currently available under {{SERVERNAME}}) exists to document development of the [[ReiserFS]] and [[Reiser4]] filesystems. As the header states, quite a few pages are copies of the last working copy of the [http://www.archive.org/web/web.php Wayback Machine] from the original sources - the [http://www.namesys.com namesys.com] site. However, as the time of writing (and since creation of this wiki, 2009-06-25), the original content is not accessible any more. As the pages were created, the first changelog entry usually lists the original source where the content was copied from (example: [http://{{SERVERNAME}}/index.php?title=Reiser4&diff=1611&oldid=1319 Reiser4]). Please refer to [[Reiser4_FS_Wiki:Copyrights|the copyrights page]] for details. c7976172acb06f0b38dff8448dd1dacfe1cb0c09 Reiser4 FS Wiki:Copyrights 4 80 1654 1652 2010-01-30T04:15:18Z Chris goe 2 JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE == Reiser4 wiki copyright == All submitted work is governed by the [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution/Share-Alike License], except for content where different terms apply: === archive.org copyright === Please see their [http://www.archive.org/about/terms.php terms of use]. Sites copied from the internet archive usually contain the following note: // JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. // ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. // SECTION 108(a)(3)). === namesys.com copyright === As <tt>namesys.com</tt> is down, we were not able to copy any content from this site or lookup their terms of use. Please let us know if you find something. 173df2d303d6cbaf09e565e3e32ccdb5a612f510 1652 2010-01-30T04:04:40Z Chris goe 2 Created page with '== Reiser4 wiki copyright == All submitted work is governed by the [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution/Share-Alike License], except for …' == Reiser4 wiki copyright == All submitted work is governed by the [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution/Share-Alike License], except for content where different terms apply: === archive.org copyright === Please see their [http://www.archive.org/about/terms.php terms of use]. === namesys.com copyright === As <tt>namesys.com</tt> is down, we were not able to copy any content from this site or lookup their terms of use. Please let us know if you find something. 8c2804f281b35b150e9ab16164e05fc9948926be Reiser4 FS Wiki:Current events 4 25 1932 1365 2010-10-27T22:06:25Z Chris goe 2 -> Main_Page #REDIRECT [[Main_Page]] f84835069b21a3cdfd413b302965d01449ae7b8f 1365 1352 2009-06-25T09:08:04Z Chris goe 2 redirect fixed #REDIRECT [[Reiser4:News]] 1738bc0c81ecdf2f082684453349f4072770f8b1 1352 1332 2009-06-25T08:31:27Z Chris goe 2 -> news_contents #REDIRECT [[Reiser4:News_Contents]] f6576750b97872ba6c2e94be60793441502cf71f 1332 2009-06-25T07:50:21Z Chris goe 2 -> Reiser4:News #REDIRECT [[Reiser4:News]] 1738bc0c81ecdf2f082684453349f4072770f8b1 File:Bilbo.jpg 6 65 1591 2009-07-06T01:55:28Z Chris goe 2 http://www.namesys.com.wstub.archive.org/pics/whitepaper/bilbo.jpg http://www.namesys.com.wstub.archive.org/pics/whitepaper/bilbo.jpg f78a07006ead8e8ebe040c0bd9aa088f5e6ac3ed File:Compilebench-0.6.pdf 6 1071 3931 3901 2014-06-12T20:37:56Z Chris goe 2 compilebench linked [https://oss.oracle.com/~mason/compilebench/ Compilebench-0.6 (c) Oracle] HP xw4600 Intel(R) Core(TM)2 Quad CPU Q9300 2.50GHz HDD Hitachi HDE72101 7200 RPM 2G RAM Linux-3.14.1 + reiser4-for-3.14.1.patch f5b070cb3ad74a52b94ee76c80c4ca5fc84a3b3a 3901 3891 2014-06-12T20:15:11Z Chris goe 2 description added Compilebench-0.6 (c) Oracle HP xw4600 Intel(R) Core(TM)2 Quad CPU Q9300 2.50GHz HDD Hitachi HDE72101 7200 RPM 2G RAM Linux-3.14.1 + reiser4-for-3.14.1.patch 8adc8d7d00f67433ce01481cfbee4f2b2511db4d 3891 3871 2014-06-12T12:01:55Z Edward 4 Edward uploaded a new version of &quot;[[File:Compilebench-0.6.pdf]]&quot; da39a3ee5e6b4b0d3255bfef95601890afd80709 3871 2014-06-11T15:48:52Z Edward 4 da39a3ee5e6b4b0d3255bfef95601890afd80709 File:File size dist.png 6 56 1540 2009-06-30T17:15:28Z Chris goe 2 http://web.archive.org/web/20061113154921/http://www.namesys.com/internal-benchmarks/mongo/file_size_dist.jpg (it's a .png really) http://web.archive.org/web/20061113154921/http://www.namesys.com/internal-benchmarks/mongo/file_size_dist.jpg (it's a .png really) a171c6e197f3090a7193f190be9ed3d6ab9a7000 File:Fs-bench-py.txt 6 62 2002 1982 2010-10-27T22:14:00Z Chris goe 2 formatting fixes From: http://vizzzion.org/stuff/fs-bench.py MD5: eb5e3e3ecc1c17f215c3128f6a7236ae 9a5d6f4ef71fd7f644b5c6da85f4baec3217f583 1982 1570 2010-10-27T22:13:04Z Chris goe 2 From: http://vizzzion.org/stuff/fs-bench.py MD5: eb5e3e3ecc1c17f215c3128f6a7236ae 603b9f6e471400b25d5c23ff00da6fa46cce1258 1570 2009-07-03T20:22:12Z Chris goe 2 MD5 (fs-bench.py) = eb5e3e3ecc1c17f215c3128f6a7236ae http://vizzzion.org/stuff/fs-bench.py MD5 (fs-bench.py) = eb5e3e3ecc1c17f215c3128f6a7236ae http://vizzzion.org/stuff/fs-bench.py f78c63b8ab4e255eaa3a6b9cf8d5605aacd90d34 File:Grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.txt 6 57 1992 1542 2010-10-27T22:13:43Z Chris goe 2 formatting fixes From: http://m.domaindlx.com/LinuxHelp/installs/grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.bz2 MD5: 423d04e95c4c2d90b840f67e8a3a5024 c5397dfd548b013404508aa73c6ec4beff3e30d1 1542 2009-07-02T18:50:14Z Chris goe 2 md5sum: 423d04e95c4c2d90b840f67e8a3a5024 originally from http://m.domaindlx.com/LinuxHelp/installs/grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.bz2 md5sum: 423d04e95c4c2d90b840f67e8a3a5024 originally from http://m.domaindlx.com/LinuxHelp/installs/grub-0.97-libaal-1.0.5-reiser4progs-1.0.5.patch.bz2 967edf9ddc2380b27baef3c7369cc8b8d63f3b10 File:Me.jpg 6 54 1538 2009-06-30T14:43:33Z KorgWikiSysop 1 da39a3ee5e6b4b0d3255bfef95601890afd80709 File:My-secrets.jpg 6 66 1592 2009-07-06T01:55:47Z Chris goe 2 http://www.namesys.com.wstub.archive.org/pics/whitepaper/my-secrets.jpg http://www.namesys.com.wstub.archive.org/pics/whitepaper/my-secrets.jpg 23ff76fae197953d7052577061e7b0de15fcd3bd File:Passwd.jpg 6 67 1593 2009-07-06T01:56:02Z Chris goe 2 http://www.namesys.com.wstub.archive.org/pics/whitepaper/passwd.jpg http://www.namesys.com.wstub.archive.org/pics/whitepaper/passwd.jpg 98ddb8a26118c7f47f56341cfd5db2668c19ad06 File:Persistent-nesting-7.diff.txt 6 59 2012 1972 2010-10-27T22:14:22Z Chris goe 2 formatting fixes Author: Chris Mason <mason@suse.com> Origin: ftp://ftp.suse.com/people/mason/patches/intermezzo-alpha/ Mirror: http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ MD5: 16b9d303c7199ba745fa1e7a2f0c657e 0caf1635c3373033c1607e9afddac22bcc716f05 1972 1656 2010-10-27T22:12:22Z Chris goe 2 formatting fixes Author: Chris Mason <mason@suse.com> Origin: ftp://ftp.suse.com/people/mason/patches/intermezzo-alpha/ Mirror: http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ MD5SUM: 16b9d303c7199ba745fa1e7a2f0c657e 77b334549c76b523746d6074c76a0e8a2a64a4d5 1656 1555 2010-02-03T00:59:36Z Chris goe 2 author added * Author: Chris Mason <mason@suse.com> * Origin: ftp://ftp.suse.com/people/mason/patches/intermezzo-alpha/ * Mirror: http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ * MD5SUM: 16b9d303c7199ba745fa1e7a2f0c657e 81c627f68e655b2ffb2b809c5bff8fa7d464d673 1555 2009-07-03T03:24:57Z Chris goe 2 origin: ftp://ftp.suse.com/people/mason/patches/intermezzo-alpha/ mirror: http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ md5sum: 16b9d303c7199ba745fa1e7a2f0c657e origin: ftp://ftp.suse.com/people/mason/patches/intermezzo-alpha/ mirror: http://mirror.fraunhofer.de/ftp.suse.com/people/mason/patches/intermezzo-alpha/ md5sum: 16b9d303c7199ba745fa1e7a2f0c657e 77a80e49801908ddfbe4621b4f64717796ca20ec File:Pruner.jpg 6 68 1594 2009-07-06T01:56:18Z Chris goe 2 http://www.namesys.com.wstub.archive.org/pics/whitepaper/pruner.jpg http://www.namesys.com.wstub.archive.org/pics/whitepaper/pruner.jpg 43108ae1fc03e514eabc502ffe34ba88082779de File:Reindeer.jpg 6 69 1595 2009-07-06T01:56:31Z Chris goe 2 http://www.namesys.com.wstub.archive.org/pics/whitepaper/reindeer.jpg http://www.namesys.com.wstub.archive.org/pics/whitepaper/reindeer.jpg 8dec8f696e20e0fa6cac9bbafd2a8aa91a34d85d File:Slow.c.txt 6 63 2022 1962 2010-10-27T22:15:34Z Chris goe 2 md5 sum added From: http://vizzzion.org/stuff/slow.c MD5: 8f462bc167721dbb518da005e0bd84d4 > "slow.c is a benchmark program from the ReiserFS team. > I found on the Reiserfs4 website." d324c7faa1519948f33d4d6bc5aba40f16da3442 1962 1572 2010-10-27T22:11:51Z Chris goe 2 formatting fixes From: http://vizzzion.org/stuff/slow.c > "slow.c is a benchmark program from the ReiserFS team. > I found on the Reiserfs4 website." 8b02ea62ecada001bb823d522302f8c1fd04d37a 1572 2009-07-04T19:30:04Z Chris goe 2 from: http://vizzzion.org/stuff/slow.c ...which in turn found this one on: "slow.c is a benchmark program from the ReiserFS team. I found on the Reiserfs4 website." from: http://vizzzion.org/stuff/slow.c ...which in turn found this one on: "slow.c is a benchmark program from the ReiserFS team. I found on the Reiserfs4 website." 178fa97c26dd74e57ae28f56d2b16be4d1ec8d05 File:Syntax-barrier.jpg 6 70 1596 2009-07-06T01:57:04Z Chris goe 2 http://www.namesys.com.wstub.archive.org/pics/whitepaper/syntax-barrier.jpg http://www.namesys.com.wstub.archive.org/pics/whitepaper/syntax-barrier.jpg 1ee0381bd38e5a962bba2c6de8330ad07baed17a File:Ultimatum.jpg 6 71 1597 2009-07-06T01:57:16Z Chris goe 2 http://www.namesys.com.wstub.archive.org/pics/whitepaper/ultimatum.jpg http://www.namesys.com.wstub.archive.org/pics/whitepaper/ultimatum.jpg 92167d4cf2af3bda10df857eb689bcbdd7e6f9f8 MediaWiki:Copyright 8 78 1653 1649 2010-01-30T04:09:57Z Chris goe 2 Reiser4_FS_Wiki:Copyrights linked Most of the content available under the [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution/Share-Alike License], except for the [[Reiser4_FS_Wiki:About|parts copied from the internet archive]], where [[Reiser4_FS_Wiki:Copyrights|different terms]] may apply. f89d2d7b74b15c5c06c1dc51cfdb3eb84bd1a940 1649 2010-01-30T03:51:50Z Chris goe 2 see http://www.mediawiki.org/wiki/Copyright Most of the content available under the [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution/Share-Alike License], except for the [[Reiser4_FS_Wiki:About|parts copied from the internet archive]], where different terms may apply. 38a3a0d9081463f004c9fc358f0a9a990e8abe00 MediaWiki:Copyrightwarning 8 79 1651 1650 2010-01-30T03:57:06Z Chris goe 2 see http://www.mediawiki.org/wiki/MediaWiki:Copyrightwarning By saving, you agree to irrevocably release your contribution under the [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution/Share-Alike License 3.0]. You agree to be credited by re-users, at minimum, through a hyperlink or URL to the page you are contributing to. If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.<br /> You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see [[Reiser4 FS Wiki:Copyrights|Copyrights]] for details). '''Do not submit copyrighted work without permission!''' 8d29acbecb58c102d9af16019e4ef0d3f68f67b7 1650 2010-01-30T03:55:15Z Chris goe 2 Created page with 'By saving, you agree to irrevocably release your contribution under the [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution/Share-Alike License 3.0]. You…' By saving, you agree to irrevocably release your contribution under the [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution/Share-Alike License 3.0]. You agree to be credited by re-users, at minimum, through a hyperlink or URL to the page you are contributing to. If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.<br /> You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. '''Do not submit copyrighted work without permission!''' a5709e379bb85280cdd3f37d6f8bbd8895445f1a MediaWiki:Sidebar 8 4 1942 1428 2010-10-27T22:08:18Z Chris goe 2 test. * navigation ** mainpage|mainpage-description ** FAQ|FAQ <!--** currentevents-url|currentevents--> ** recentchanges-url|recentchanges ** :category:Reiser4|Reiser4 ** :category:ReiserFS|ReiserFS ** Manpages|Manpages * SEARCH * TOOLBOX * LANGUAGES 0c81a45f46ebd06cc8ba0c1df2c397222ab6615a 1428 1331 2009-06-27T01:18:35Z Chris goe 2 Frequently_Asked_Questions -> FAQ * navigation ** mainpage|mainpage-description ** FAQ|FAQ ** currentevents-url|currentevents ** recentchanges-url|recentchanges ** Manpages|Manpages * SEARCH * TOOLBOX * LANGUAGES 95f25669abd04c60181a899796ad3d7eca6e9ed8 1331 1288 2009-06-25T07:49:15Z Chris goe 2 manpages added * navigation ** mainpage|mainpage-description ** Frequently_Asked_Questions|FAQ ** currentevents-url|currentevents ** recentchanges-url|recentchanges ** Manpages|Manpages * SEARCH * TOOLBOX * LANGUAGES 6a4c3a2e300da4b115be4d80329ff4b8aaff06a1 1288 1287 2009-06-25T05:53:58Z Chris goe 2 * navigation ** mainpage|mainpage-description ** Frequently_Asked_Questions|FAQ ** currentevents-url|currentevents ** recentchanges-url|recentchanges ** helppage|help * SEARCH * TOOLBOX * LANGUAGES 240329a850db349f87859b663ef3f5c03732c777 1287 2009-06-25T05:53:33Z Chris goe 2 Created page with '* navigation ** mainpage|mainpage-description ** FAQ|Frequently_Asked_Questions ** currentevents-url|currentevents ** recentchanges-url|recentchanges ** helppage|help * SEARCH * ...' * navigation ** mainpage|mainpage-description ** FAQ|Frequently_Asked_Questions ** currentevents-url|currentevents ** recentchanges-url|recentchanges ** helppage|help * SEARCH * TOOLBOX * LANGUAGES b98a4fecadaa74375e5c25708c3a2cb55c4feaae MediaWiki:Sitenotice 8 75 4319 4269 2019-04-16T08:54:02Z Chris goe 2 URL updated '''Welcome to the Reiser4 Wiki, the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems.''' For now, most of the documentation is just a [https://web.archive.org/web/20071006090909/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). There was also a [https://web.archive.org/web/20070706050724/http://pub.namesys.com/ Reiser4 Wiki] (archive.org, 2007-07-06) once on <tt>pub.namesys.com</tt>. 32c19af03fbf2f60a029ce6c7e917c6fe1aa06e1 4269 2361 2017-06-25T17:38:36Z Chris goe 2 https everywhere! '''Welcome to the Reiser4 Wiki, the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems.''' For now, most of the documentation is just a [https://wayback.archive.org/web/20071006090909/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). There was also a [https://web.archive.org/web/20070706050724/http://pub.namesys.com/ Reiser4 Wiki] (archive.org, 2007-07-06) once on <tt>pub.namesys.com</tt>. 7bd3d54b9168b351f3010c9d5e1ce132024ed196 2361 1902 2012-09-16T08:08:54Z Chris goe 2 -formatting '''Welcome to the Reiser4 Wiki, the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems.''' For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). There was also a [http://web.archive.org/web/20070706050724/http://pub.namesys.com/ Reiser4 Wiki] (archive.org, 2007-07-06) once on <tt>pub.namesys.com</tt>. 4909d9bfb7c837f0d88974f7e389e6a7ebf101d5 1902 1614 2010-10-18T01:17:13Z Chris goe 2 Welcome to the Reiser4 Wiki, the Wiki for users and developers of the ReiserFS and Reiser4 filesystems. '''Welcome to the Reiser4 Wiki, the Wiki for users and developers of the [[ReiserFS]] and [[Reiser4]] filesystems.''' <font color="green" size=-1> * For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). * There was also a [http://web.archive.org/web/20070706050724/http://pub.namesys.com/ Reiser4 Wiki] (archive.org, 2007-07-06) once on <tt>pub.namesys.com</tt>. </font> 4bc66d1125e1a99ca72bb6762fb35bc22a040ec1 1614 992 2009-07-21T06:24:50Z Chris goe 2 sitenotice created <font color="green" size=-1> * For now, most of the documentation is just a [http://web.archive.org/web/20070929195459/http://www.namesys.com/ snapshot of the old Namesys site] (archive.org, 2007-09-29). * There was also a [http://web.archive.org/web/20070706050724/http://pub.namesys.com/ Reiser4 Wiki] (archive.org, 2007-07-06) once on <tt>pub.namesys.com</tt>. </font> f5b1adb350f0feb653d7ba4f6f4f1dbfed25ed6e 992 2006-08-29T14:11:11Z MediaWiki default 0 - 3bc15c8aae3e4124dd409035f32ea2fd6835efc9 Template:Box 10 81 2082 1666 2010-10-27T22:48:51Z Chris goe 2 category added <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; width: {{{2}}}%"> {{{1}}}</div><noinclude> Example: <nowiki>{{box|Hello, Box!|30}}</nowiki> will be turned into: {{box|Hello, Box!|30}} [[category:Template]] </noinclude> f841b67e6695a3e32a8a33b97b6babbd631bf061 1666 1665 2010-02-03T01:22:38Z Chris goe 2 <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; width: {{{2}}}%"> {{{1}}}</div><noinclude> Example: <nowiki>{{box|Hello, Box!|30}}</nowiki> will be turned into: {{box|Hello, Box!|30}} </noinclude> 5b971371b3ad104e467118475d8282be5a3801ab 1665 1664 2010-02-03T01:22:05Z Chris goe 2 <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; width: {{{2}}}%"> {{{1}}}</div><noinclude> Example: <nowiki>{{box|hello, Box!|30}}</nowiki> will be turned into: {{box|hello, Box!|30}} </noinclude> caf70cb6c67ed38e0544fb56a512e54af1296790 1664 1663 2010-02-03T01:21:31Z Chris goe 2 . <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; width: {{{2}}}%"> {{{1}}} </div><noinclude> Example: <nowiki>{{box|hello, Box!|30}}</nowiki> will be turned into: {{box|hello, Box!|30}} </noinclude> 0d77b765fce545fb3df7bb691d744c08dfed556b 1663 1662 2010-02-03T01:20:22Z Chris goe 2 formatting fixes <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; width: 70%"> {{{1}}} </div><noinclude> Example: <nowiki>{{box|hello, Box!}}</nowiki> will be turned into: {{box|hello, Box!}} </noinclude> 4b6231ae7e7306dd08dd1c23d8c0bb60a86ee118 1662 1661 2010-02-03T01:19:43Z Chris goe 2 . <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; border-style: solid; border-color: blue; width: 70%"> {{{1}}} </div><noinclude> Example: <nowiki>{{box|hello, Box!}}</nowiki> will be turned into: {{box|hello, Box!}} </noinclude> 27e7a87c15472c11da321b2678c2c34528707c13 1661 1660 2010-02-03T01:17:59Z Chris goe 2 <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; border-style: solid; border-color: blue; width: 70%"> {{{text}}} </div><noinclude> Example: <nowiki>{{box|text}}</nowiki> will be turned into: {{box|text}} </noinclude> 7e6fee356b649b591a30fd51d1d7c510867a31d3 1660 1659 2010-02-03T01:16:07Z Chris goe 2 <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; border-style: solid; border-color: blue; width: 70%"> {{{text}}} </div><noinclude> Example: <nowiki>{{box|text}}</nowiki> will be turned into: {{box|text}} </noinclude> . bf241a1fac7ab9526c1088f9fdb2e62accd5446e 1659 1658 2010-02-03T01:16:00Z Chris goe 2 <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; border-style: solid; border-color: blue; width: 70%"> {{{text}}} </div><noinclude> Example: <nowiki>{{box|text}}</nowiki> will be turned into: {{box|text}} </noinclude> 7e6fee356b649b591a30fd51d1d7c510867a31d3 1658 1657 2010-02-03T01:13:45Z Chris goe 2 box created <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; border-style: none"></div><noinclude> Example: <nowiki>{{box|text}}</nowiki> will be turned into: {{box|text}} </noinclude> b1aaa62af96412e5c215711d434b22248866222c 1657 2010-02-03T01:08:01Z Chris goe 2 Created page with '<div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; border-style: solid; border-color: blue; width: 70%"> {{{text}}} </div><noinclude> </noinclude>' <div style="background: lightblue; border: 2px; padding: 5px; margin: 10px; border-style: solid; border-color: blue; width: 70%"> {{{text}}} </div><noinclude> </noinclude> 30e5bd80b1a6740108385560dcdd332de81aef43 Template:Listaddress 10 53 2052 1537 2010-10-27T22:41:49Z Chris goe 2 category added <reiserfs-devel@vger.kernel.org><noinclude> This template should always contain the current mailinglist address people can use to report any ReiserFS/Reiser4 issues. [[category:Template]] </noinclude> 0ba2432544262a8bdefb1cfc159cef8369103baa 1537 1535 2009-06-28T18:33:25Z Chris goe 2 <> added <reiserfs-devel@vger.kernel.org><noinclude> This template should always contain the current mailinglist address people can use to report any ReiserFS/Reiser4 issues. </noinclude> 150a65fff64f680297562dcbb2200533cd638a2d 1535 2009-06-28T18:30:49Z Chris goe 2 Created page with 'reiserfs-devel@vger.kernel.org <noinclude> This template should always contain the current mailinglist address people can use to report any ReiserFS/Reiser4 issues. </noinclude>' reiserfs-devel@vger.kernel.org <noinclude> This template should always contain the current mailinglist address people can use to report any ReiserFS/Reiser4 issues. </noinclude> c4e3c853f3a8f58633f6f8cda1d40682d7a6d203 Template:Wayback 10 83 2072 1743 2010-10-27T22:47:02Z Chris goe 2 category added <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes, adding or correcting sources and maybe<br> spelling- and punctuation fixes to this document. Thanks!</font></includeonly><noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} [[category:Template]] </noinclude> 7c61d00154a4ddea3d2d8407e115bd5cba704a6f 1743 1721 2010-04-25T04:37:55Z Chris goe 2 no <br> <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes, adding or correcting sources and maybe<br> spelling- and punctuation fixes to this document. Thanks!</font></includeonly><noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> fb5e72e5a020a096c7cee75caff356c70367f5e2 1721 1720 2010-04-25T03:15:25Z Chris goe 2 . <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes, adding or correcting sources and maybe<br> spelling- and punctuation fixes to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> b0494a86d87127b3b917685a9679a8a53b077d8a 1720 1719 2010-04-25T03:14:42Z Chris goe 2 . <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes and adding or correcting sources (and maybe<br> spelling and punctuation fixes) to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 94fd140a97dbc595be1306768b9cfd648538f42f 1719 1718 2010-04-25T03:14:09Z Chris goe 2 . <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes and adding or correcting sources (and maybe<br> spelling and punctuation fixes) to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> b62be432341ec5e8f6750397060d214c56f3550b 1718 1715 2010-04-25T03:13:41Z Chris goe 2 yes, I just hate 404s :-\ <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes and adding or correcting sources (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> c0322d7a02580cb3c850b8b69d8bccbecc27083f 1715 1714 2010-04-25T02:59:24Z Chris goe 2 . <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 62bf94a9de4effb1a2f744c8690f85f6af2f6f48 1714 1713 2010-04-25T02:59:06Z Chris goe 2 . <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 8346f9f581b69408cea50f93619a4f66dfb695d9 1713 1712 2010-04-25T02:58:36Z Chris goe 2 . <includeonly> <font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 7f443105d850ac3184d51f94a0a2595393476932 1712 1711 2010-04-25T02:57:39Z Chris goe 2 . <includeonly><font color="green"> This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 30cda4998ec2b2fe4bc9ba033aab5e092c77d4a6 1711 1710 2010-04-25T02:57:05Z Chris goe 2 . <includeonly><font color="green"> This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 7ac518f427831c85633d5a0db6a8bba2a7de169f 1710 1707 2010-04-25T02:56:19Z Chris goe 2 more spaces <includeonly><font color="green"> This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <pre><nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki></pre> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 4843823f9292518bfac614576b001fad4a3a13a2 1707 1706 2010-04-25T02:52:58Z Chris goe 2 test <includeonly><font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> c99a224488e0dd9c82ae764be4f938c88a0e4c3f 1706 1705 2010-04-25T02:50:02Z Chris goe 2 wording <includeonly><font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> It was written by its respective author(s) and ''not'' by the author(s) of this article.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> ff2e7355948df26469caa34eadf8a986e17efe7a 1705 1704 2010-04-25T02:47:51Z Chris goe 2 formatting fixes <includeonly><font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br> to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 2b818aea1bd6e301a9c27220b14d3d6224844ceb 1704 1703 2010-04-25T02:47:39Z Chris goe 2 formatting fixes <includeonly><font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br> Please apply only formatting changes (and maybe spelling and punctuation fixes)<br>to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> a5a056a18144ea5d76e325e705de302477286ff8 1703 1702 2010-04-25T02:47:21Z Chris goe 2 formatting fixes <includeonly><font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br>Please apply only formatting changes (and maybe spelling and punctuation fixes)<br>to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> b82adb6ad4230b6307bee865a2e20d6b7855626b 1702 1701 2010-04-25T02:46:59Z Chris goe 2 formatting fixes <includeonly><font color="green"> This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br>Please apply only formatting changes (and maybe spelling and punctuation fixes)<br>to this document. Thanks!</font></includeonly> <noinclude> Please use this template when copying documents from archive.org to this wiki. Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}} </noinclude> 2c1627785d9c02b9fed4160961b8287742731928 1701 1700 2010-04-25T02:45:06Z Chris goe 2 formatting fixes <includeonly><font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br>Please apply only formatting changes (and maybe spelling and punctuation fixes)<br>to this document. Thanks!</font></includeonly> <noinclude> Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/txn-doc.html|2006-11-13}} </noinclude> b9a705b85369f51c1df1feef828ee8dc4db292d1 1700 1699 2010-04-25T02:44:23Z Chris goe 2 formatting fixes <includeonly><font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}.<br/>Please apply only formatting changes (and maybe spelling and punctuation fixes)<br/> to this document. Thanks!</font></includeonly> <noinclude> Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/txn-doc.html|2006-11-13}} </noinclude> eb209090fd53565f3380a384d226f80aeafff9f0 1699 1673 2010-04-25T02:42:57Z Chris goe 2 change purpose... <includeonly><font color="green">This document has been retrieved from [http://web.archive.org/web/*/{{{1}}} archive.org] in its version from {{{2}}}. Please apply only formatting changes (and maybe spelling and punctuation fixes) to this document. Thanks!</font></includeonly> <noinclude> Example: <nowiki>{{wayback|http://www.namesys.com/X0reiserfs.html|2006-11-13}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/txn-doc.html|2006-11-13}} </noinclude> 17abd85c9b96d67c6fe3a2e874f4abea8c4de4bd 1673 1672 2010-02-10T06:24:34Z Chris goe 2 use the moste current version http://web.archive.org/web/{{{1}}} <noinclude> Example: <nowiki>{{wayback|http://www.namesys.com/txn-doc.html}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/txn-doc.html}} </noinclude> 294eacda855898a7f694cd48e59d54f6ff5e74bb 1672 1671 2010-02-10T06:15:13Z Chris goe 2 http://web.archive.org/web/*/{{{1}}} <noinclude> Example: <nowiki>{{wayback|http://www.namesys.com/txn-doc.html}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/txn-doc.html}} </noinclude> 3fd96818fd1b21d5b7c8098e694f418ae6e5260d 1671 2010-02-10T06:14:57Z Chris goe 2 Created page with 'http://web.archive.org/web/*/{{{1}}} <noinclude> Example: <nowiki>{{wayback|http://www.namesys.com/txn-doc.html}}</nowiki> will be turned into: {{wayback|http://www.names…' http://web.archive.org/web/*/{{{1}}} <noinclude> Example: <nowiki>{{wayback|http://www.namesys.com/txn-doc.html}}</nowiki> will be turned into: {{wayback|http://www.namesys.com/txn-doc.html}} </noinclude> 820cfa96973318e478e1221ffb33f313c656a5b0 Help:Contents 12 1095 4129 2016-05-24T20:33:21Z Chris goe 2 https://www.mediawiki.org/wiki/Help:Editing #REDIRECT [https://www.mediawiki.org/wiki/Help:Editing Help:Editing] 7730fc6de0199289733252f1cc07e53c75c4ff53 Category:Formatting-fixes-needed 14 52 1519 2009-06-27T19:18:28Z Chris goe 2 description added '''temporary category with articles taken off the archive.org sites and need to be wikified. Help needed :-)''' c562baabbd53d3d826e589f2690bc7528285e590 Category:Reiser4 14 7 1294 2009-06-25T06:04:45Z Chris goe 2 Created page with 'Everything related to Reiser4' Everything related to Reiser4 0e3d0fcd653bd40ede47df8c665a610462ddc77c Category:ReiserFS 14 6 1293 2009-06-25T06:04:17Z Chris goe 2 Created page with 'Everything related to ReiserFS' Everything related to ReiserFS 1b68cb523d28b1d3b764eb3b2cf45e1adfbf6695 Category:Template 14 102 2042 2010-10-27T22:20:41Z Chris goe 2 Created page with 'Templates' Templates f25b700ed9f092123a43acb205a6869342cf9dd6